# Greedy Algorithm

## Interval Scheduling

- Job j starts at $s_j$ and finishes at $f_j$.
- Two jobs compatible if they don't overlap.
- Goal: find maximum subset of mutually compatible jobs.

**Greedy algorithm**.

Consider jobs in increasing order of **finish time**.

Take each job provided it's compatible with the ones already taken.

按照结束时间升序排列： 先完成 DDL 早的

$$\text{Sort jobs by finish times so that }f_1 < f_2 < ... < f_n.\

A \leftarrow \emptyset\

for, j = 1\text{to n} { \

\qquad if (job j compatible with A)\

\qquad A \leftarrow A \cup {j}\

}\

return A $$

Implementation. $O(n log n)$.

- Remember job j* that was added last to A.
- Job j is compatible with A if $s_j < f_{j^*}$.

Analysis:

**Theorem**. **Greedy algorithm is optimal**.

Pf. (by contradiction)

Assume greedy is not optimal, and let's see what happens.

Let $i_1, i_2, ... i_k$ denote set of jobs selected by greedy.

Let $j_1, j_2, ... j_m$ denote set of jobs in the optimal solution with $i_1 = j_1, i_2 = j_2, ..., i_r = j_r$ for the largest possible value of r.

## Interval Partitioning

- Lecture j starts at $s_j$ and finishes at $f_j$.
- Goal: find minimum number of classrooms to schedule all lectures so that no two occur at the same time in the same room.

**Def**. The depth of a set of open intervals is the maximum number that contain any given time.

同一时间发生任务的最大数量是需要的教室数量的上限

**Greedy algorithm**：

Consider lectures in increasing order of start time: assign lecture to any compatible classroom

按照开始时间升序排列

$$\text{Sort intervals by starting time so that} s1 < s2 < ... < sn.\

d \leftarrow 0\

for, j = 1, to, n, {\

\qquad \text{if (lecture j is compatible with some classroom k)}\

\qquad \qquad \text{schedule lecture j in classroom k}\

\qquad else\qquad \qquad \text{allocate a new classroom d + 1}\

\qquad \qquad \text{schedule lecture j in classroom d + 1}\

\qquad \qquad d \leftarrow d + 1\

} $$

**Implementation**. $O(n log n)$.

- For each classroom k, maintain the finish time of the last job added.
- Keep the classrooms in a priority queue.

**Analysis**:

Greedy algorithm never schedules two incompatible lectures in the same classroom.

**Theorem. Greedy algorithm is optimal.**

Pf.

- Let d = number of classrooms that the greedy algorithm allocates.
- Classroom d is opened because we needed to schedule a job, say j, that is incompatible with all d-1 other classrooms.
- These d jobs each end after $s_j$.
- Since we sorted by start time, all these incompatibilities are caused by lectures that start no later than $s_j$.
- Thus, we have d lectures overlapping at time $s_j$.
- Key observation => all schedules use >= d classrooms. ▪

## Scheduling to Minimize Lateness

前面描述的是如何合理安排好任务，这里主要关注，在不可避免会拖延的时候，如何减少拖延损失。

- Single resource processes one job at a time.
- Job j requires $t_j$ units of processing time and is due at time $d_j$.
- If j starts at time $s_j$, it finishes at time $f_j= s_j + t_j$.
- Lateness: $l_j = max{0, fj - dj}$.
- Goal: schedule all jobs to minimize
**maximum**lateness $L = max, l_j$.

Greedy Algorithms：**Earliest deadline first.** 仍然是按照 DDL 先后排序

$$\text{Sort n jobs by deadline so that} d_1 < d2 < … < d_n\

t \leftarrow 0\

for, j = 1, to, n\

\qquad \text{Assign job j to interval} [t, t + t_j]\

\qquad s_j \leftarrow t, f_j \leftarrow t + t_j\

\qquad t \leftarrow t + t_j\

\text{output intervals} [s_j, f_j]$$

Observation. There exists an optimal schedule with no idle time.

Observation. The greedy schedule has no idle time.

由于贪心算法尽可能直接完成任务，因此没有的间隙 (在不限制开始时间的情况下)

**Def**.

**Inversion**: Given a schedule S, an inversion is a pair of jobs i and j such that: i < j but j scheduled before i.

Observation. Greedy schedule has no inversions.

Observation. If a schedule (with no idle time) has an inversion, it has one with a pair of inverted jobs scheduled consecutively.

已经按照结束时间排序了，因此贪心算法不会出现 Inversion

Claim. Swapping two consecutive, inverted jobs reduces the number of inversions by one and does not increase the max lateness

Pf. Let $l$ be the lateness before the swap, and let $l'$ be it afterwards.

- $l'_k = l_k$ for all $k \neq i, j$
- $l'_i \leq l_i$
- If job j is late:

## Optimal Caching

Caching.

- Cache with capacity to store k items.
- Sequence of m item requests $d_1, d_2, …, d_m$.
**Cache hit**: item already in cache when requested.**Cache miss**: item not already in cache when requested: must bring requested item into cache, and evict some existing item, if full.

Goal. Eviction schedule that **minimizes number of cache misses**.

**Farthest-in-future:** Evict item in the cache that is not requested until farthest in the future

Theorem. [Bellady, 1960s] FF is optimal eviction schedule.

Pf. Algorithm and theorem are intuitive; proof is subtle.

Def. A **reduced schedule** is a schedule that only inserts an item into

the cache in a step in which that item is requested.

Intuition. Can transform an unreduced schedule into a reduced one

with no more cache misses.

Claim. Given any unreduced schedule S, can transform it into a reduced

schedule S' with no more cache misses.

Pf. *(by induction on number of unreduced items)*

Suppose S brings d into the cache at time t, without a request.

Let c be the item S evicts when i t brings d into the cache.

- Case 1: d evicted at time t', before next request for d.
- Case 2: d requested at time t' before d is evicted. ▪

**Theorem. FF is optimal eviction algorithm**

Pf: // TODO: p30 greedy

## Shortest Paths in a Graph

略

## Minimum Spanning Tree

Given a connected graph G = (V, E) with real-valued edge weights $c_e$, an MST is a subset of the edges T E such that T is a spanning tree whose sum of edge weights is minimized.

本质是寻找权重最小且能使这个图保持连通性的 `n-1`

条边

### Kruskal's algorithm.

Start with $T = \emptyset $. Consider edges in ascending order of cost.

Insert edge e in T unless doing so would create a cycle.

Implementation. Use the **union-find** data structure.

- Build set T of edges in the MST.
- Maintain set for each connected component.
- O(m log n) for sorting and O(m (m, n)) for union-find.

### Reverse-Delete algorithm.

Start with T = E. Consider edges in descending order of cost.

Delete edge e from T unless doing so would disconnect T.

### Prim's algorithm

Start with some root node s and greedily grow a tree T from s outward.

At each step, add the cheapest edge e to T that has exactly one endpoint in T.

Implementation. Use a priority queue ala Dijkstra.

- Maintain set of explored nodes S.
- For each unexplored node v, maintain attachment cost a[v] = cost of cheapest edge v to a node in S.
- O(n2) with an array; O(m log n) with a binary heap.

Def:

Cut property:

Let S be any subset of nodes, and let e be the min cost edge with exactly one endpoint in S. Then the MST contains e.

Cycle property:

Let C be any cycle, and let f be the max cost edge belonging to C. Then the MST does not contain f.

Cycle

Set of edges the form a-b, b-c, c-d, …, y-z, z-a.

Cutset

A cut S is a subset of nodes. The corresponding cutset D is the subset of edges with exactly one endpoint in S

Claim. A cycle and a cutset intersect in an even number of edges.

证明：略 p10 MST

### 优化

Kruskal 和 Prim 都只能计算权重不同的情况，若存在权重相同的边，则在比较阶段优化即可：

## Clustering

聚类的本质和最小生成树差不多，无非是最小生成树是一个类，需要 n-1 条边

对于聚类数k，需要 n-k 条边，及在最小生成树的基础上删去 (k-1) 条权重最高的边即可

**k-clustering**: Divide objects into k non-empty groups.

**Distance function**: Assume it satisfies several natural properties.

- $d(p_i, p_j) = 0$ iff $p_i = p_j$ (identity of indiscernibles)
- $d(p_i, p_j) \geq 0$ (nonnegativity)
- $d(p_i, p_j) = d(p_j, p_i)$ (symmetry)

**Spacing**: Min distance between any pair of points in different clusters.

**Clustering of maximum spacing**: Given an integer k, find a k-clustering of maximum spacing.

**Single-link k-clustering algorithm**

Form a graph on the vertex set U, corresponding to n clusters.

Find the closest pair of objects such that each object is in a different cluster, and add an edge between them to merge these two clusters into a new cluster.

Repeat n-k times until there are exactly k clusters.

**Key observation**. This procedure is precisely Kruskal's algorithm (except we stop when there are k connected components).

**Remark**. Equivalent to finding an MST and deleting the k-1 most expensive edges.

证明：反证法

## Huffman Codes

省略了证明部分

T(n) = T(n-1) + O(n)

so time complexity is $O(N^2)$

Using priority queue: $T(n) = T(n-1) + O(logn) => O(NlogN)$

**Claim. Huffman code for S achieves the minimum ABL of any prefix code.**

**Pf. (by induction)****Base**: For n=2 there is no shorter code than root and two leaves.**Hypothesis**: Suppose Huffman tree T’ for S’ with $\omega$ instead of y and z is optimal. (IH)**Step**: (by contradiction)

- Suppose Huffman tree T for S is not optimal.
- So there is some tree Z such that ABL(Z) < ABL(T).
- Then there is also a tree Z for which leaves y and z exist that are

siblings and have the lowest frequency (see observation). - Let Z’ be Z with y and z deleted, and their former parent labeled $\omega $.
- Similar T’ is derived from S’ in our algorithm.
- We know that $ABL(Z’)=ABL(Z)-f_\omega$, as well as $ABL(T’)=ABL(T)-f_\omega$.
- But also ABL(Z) < ABL(T), so ABL(Z’) < ABL(T’).
- Contradiction with IH.