算法 | 贪心算法 - Greedy Algorithm

Greedy Algorithm

Interval Scheduling

  • Job j starts at $s_j$ and finishes at $f_j$.
  • Two jobs compatible if they don't overlap.
  • Goal: find maximum subset of mutually compatible jobs.

Greedy algorithm.

Consider jobs in increasing order of finish time.Take each job provided it's compatible with the ones already taken.

按照结束时间升序排列: 先完成 DDL 早的


$$\text{Sort jobs by finish times so that }f_1 < f_2 < ... < f_n.\A \leftarrow \emptyset\for, j = 1\text{to n} { \qquad if (job j compatible with A)\qquad A \leftarrow A \cup {j}}\return A $$


Implementation. $O(n log n)$.

  • Remember job j* that was added last to A.
  • Job j is compatible with A if $s_j < f_{j^*}$.

Analysis:

Theorem. Greedy algorithm is optimal.

Pf. (by contradiction)

Assume greedy is not optimal, and let's see what happens.

Let $i_1, i_2, ... i_k$ denote set of jobs selected by greedy.

Let $j_1, j_2, ... j_m$ denote set of jobs in the optimal solution with $i_1 = j_1, i_2 = j_2, ..., i_r = j_r$ for the largest possible value of r.

Interval Partitioning

  • Lecture j starts at $s_j$ and finishes at $f_j$.
  • Goal: find minimum number of classrooms to schedule all lectures so that no two occur at the same time in the same room.

Def. The depth of a set of open intervals is the maximum number that contain any given time.

同一时间发生任务的最大数量是需要的教室数量的上限

Greedy algorithm

Consider lectures in increasing order of start time: assign lecture to any compatible classroom

按照开始时间升序排列


$$\text{Sort intervals by starting time so that} s1 < s2 < ... < sn.\d \leftarrow 0\for, j = 1, to, n, {\qquad \text{if (lecture j is compatible with some classroom k)}\qquad \qquad \text{schedule lecture j in classroom k}\qquad else\qquad \qquad \text{allocate a new classroom d + 1}\qquad \qquad \text{schedule lecture j in classroom d + 1}\qquad \qquad d \leftarrow d + 1} $$


Implementation. $O(n log n)$.

  • For each classroom k, maintain the finish time of the last job added.
  • Keep the classrooms in a priority queue.

Analysis:

Greedy algorithm never schedules two incompatible lectures in the same classroom.

**Theorem. Greedy algorithm is optimal.**Pf.

  • Let d = number of classrooms that the greedy algorithm allocates.
  • Classroom d is opened because we needed to schedule a job, say j, that is incompatible with all d-1 other classrooms.
  • These d jobs each end after $s_j$.
  • Since we sorted by start time, all these incompatibilities are caused by lectures that start no later than $s_j$.
  • Thus, we have d lectures overlapping at time $s_j$.
  • Key observation => all schedules use >= d classrooms. ▪

Scheduling to Minimize Lateness

前面描述的是如何合理安排好任务,这里主要关注,在不可避免会拖延的时候,如何减少拖延损失。

  • Single resource processes one job at a time.
  • Job j requires $t_j$ units of processing time and is due at time $d_j$.
  • If j starts at time $s_j$, it finishes at time $f_j= s_j + t_j$.
  • Lateness: $l_j = max{0, fj - dj}$.
  • Goal: schedule all jobs to minimize maximum lateness $L = max, l_j$.

Greedy Algorithms:Earliest deadline first. 仍然是按照 DDL 先后排序


$$\text{Sort n jobs by deadline so that} d_1 < d2 < … < d_n\t \leftarrow 0\for, j = 1, to, n\qquad \text{Assign job j to interval} [t, t + t_j]\qquad s_j \leftarrow t, f_j \leftarrow t + t_j\qquad t \leftarrow t + t_j\text{output intervals} [s_j, f_j]$$


Observation. There exists an optimal schedule with no idle time.

Observation. The greedy schedule has no idle time.

由于贪心算法尽可能直接完成任务,因此没有的间隙 (在不限制开始时间的情况下)

Def.

Inversion: Given a schedule S, an inversion is a pair of jobs i and j such that: i < j but j scheduled before i.

Observation. Greedy schedule has no inversions.Observation. If a schedule (with no idle time) has an inversion, it has one with a pair of inverted jobs scheduled consecutively.

已经按照结束时间排序了,因此贪心算法不会出现 Inversion

Claim. Swapping two consecutive, inverted jobs reduces the number of inversions by one and does not increase the max lateness

Pf. Let $l$ be the lateness before the swap, and let $l'$ be it afterwards.

  • $l'_k = l_k$ for all $k \neq i, j$
  • $l'_i \leq l_i$
  • If job j is late:

Optimal Caching

Caching.

  • Cache with capacity to store k items.
  • Sequence of m item requests $d_1, d_2, …, d_m$.
  • Cache hit: item already in cache when requested.
  • Cache miss: item not already in cache when requested: must bring requested item into cache, and evict some existing item, if full.

Goal. Eviction schedule that minimizes number of cache misses.

Farthest-in-future: Evict item in the cache that is not requested until farthest in the future

Theorem. [Bellady, 1960s] FF is optimal eviction schedule.Pf. Algorithm and theorem are intuitive; proof is subtle.

Def. A reduced schedule is a schedule that only inserts an item intothe cache in a step in which that item is requested.

Intuition. Can transform an unreduced schedule into a reduced onewith no more cache misses.

Claim. Given any unreduced schedule S, can transform it into a reducedschedule S' with no more cache misses.

Pf. (by induction on number of unreduced items)

Suppose S brings d into the cache at time t, without a request.

Let c be the item S evicts when i t brings d into the cache.

  • Case 1: d evicted at time t', before next request for d.
  • Case 2: d requested at time t' before d is evicted. ▪

Theorem. FF is optimal eviction algorithm

Pf: // TODO: p30 greedy

Shortest Paths in a Graph

Minimum Spanning Tree

Given a connected graph G = (V, E) with real-valued edge weights $c_e$, an MST is a subset of the edges T  E such that T is a spanning tree whose sum of edge weights is minimized.

本质是寻找权重最小且能使这个图保持连通性的 n-1 条边

Kruskal's algorithm.

Start with $T = \emptyset $. Consider edges in ascending order of cost.

Insert edge e in T unless doing so would create a cycle.

Implementation. Use the union-find data structure.

  • Build set T of edges in the MST.
  • Maintain set for each connected component.
  • O(m log n) for sorting and O(m  (m, n)) for union-find.

Reverse-Delete algorithm.

Start with T = E. Consider edges in descending order of cost.

Delete edge e from T unless doing so would disconnect T.

Prim's algorithm

Start with some root node s and greedily grow a tree T from s outward.

At each step, add the cheapest edge e to T that has exactly one endpoint in T.

Implementation. Use a priority queue ala Dijkstra.

  • Maintain set of explored nodes S.
  • For each unexplored node v, maintain attachment cost a[v] = cost of cheapest edge v to a node in S.
  • O(n2) with an array; O(m log n) with a binary heap.

Def:

Cut property:

Let S be any subset of nodes, and let e be the min cost edge with exactly one endpoint in S. Then the MST contains e.

Cycle property:

Let C be any cycle, and let f be the max cost edge belonging to C. Then the MST does not contain f.

Cycle

Set of edges the form a-b, b-c, c-d, …, y-z, z-a.

Cutset

A cut S is a subset of nodes. The corresponding cutset D is the subset of edges with exactly one endpoint in S

Claim. A cycle and a cutset intersect in an even number of edges.

证明:略 p10 MST

优化

Kruskal 和 Prim 都只能计算权重不同的情况,若存在权重相同的边,则在比较阶段优化即可:

Clustering

聚类的本质和最小生成树差不多,无非是最小生成树是一个类,需要 n-1 条边

对于聚类数k,需要 n-k 条边,及在最小生成树的基础上删去 (k-1) 条权重最高的边即可

k-clustering: Divide objects into k non-empty groups.

Distance function: Assume it satisfies several natural properties.

  • $d(p_i, p_j) = 0$ iff $p_i = p_j$ (identity of indiscernibles)
  • $d(p_i, p_j) \geq 0$ (nonnegativity)
  • $d(p_i, p_j) = d(p_j, p_i)$ (symmetry)

Spacing: Min distance between any pair of points in different clusters.

Clustering of maximum spacing: Given an integer k, find a k-clustering of maximum spacing.

Single-link k-clustering algorithm

Form a graph on the vertex set U, corresponding to n clusters.

Find the closest pair of objects such that each object is in a different cluster, and add an edge between them to merge these two clusters into a new cluster.

Repeat n-k times until there are exactly k clusters.

Key observation. This procedure is precisely Kruskal's algorithm (except we stop when there are k connected components).

Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges.

证明:反证法

Huffman Codes

省略了证明部分

T(n) = T(n-1) + O(n)

so time complexity is $O(N^2)$

Using priority queue: $T(n) = T(n-1) + O(logn) => O(NlogN)$

Claim. Huffman code for S achieves the minimum ABL of any prefix code.

**Pf. (by induction)**Base: For n=2 there is no shorter code than root and two leaves.Hypothesis: Suppose Huffman tree T’ for S’ with $\omega$ instead of y and z is optimal. (IH)Step: (by contradiction)

  • Suppose Huffman tree T for S is not optimal.
  • So there is some tree Z such that ABL(Z) < ABL(T).
  • Then there is also a tree Z for which leaves y and z exist that aresiblings and have the lowest frequency (see observation).
  • Let Z’ be Z with y and z deleted, and their former parent labeled $\omega $.
  • Similar T’ is derived from S’ in our algorithm.
  • We know that $ABL(Z’)=ABL(Z)-f_\omega$, as well as $ABL(T’)=ABL(T)-f_\omega$.
  • But also ABL(Z) < ABL(T), so ABL(Z’) < ABL(T’).
  • Contradiction with IH.