e2718281828459045

Month: August, 2013

Dynamic Programming in Practice: Optimal Data Segmentation

I have been reviewing dynamic programming recently, and today I saw a nice application of dynamic programming used in Astrophysics: Dynamic Programming in Python: Bayesian Blocks.

The following is the problem in astrophysics phrased in a purely layman’s way: given a set of data points in the plane, each point is a pair — (time point, observation value), we would like to extract as much information as possible from these observations — which roughly speaking, involves computing a histogram of the observations. The difficulty here is determining the bin size: if the bin size is too large, we loose important information; however, at the same time, we would like to suppress the noise in the observations. It seems that one approach used in the past is by adjusting the bin size manually by a trial-and-error fashion: if the histogram is too jagged, then decrease the bin size; and if it is too smooth, increase the bin size.

The novel idea in this paper is that one first decides a fitness function/cost function over partitions of data using a Bayesian likelihood framework, and then tries to compute an optimum partition of the data points according to this function.

Here is the same problem phrased in the language of computer science:

Let P be a set of n points in the plane. A block B of points in P is a set of points consecutive in x-axis. A partition of P into block is specified by a sequence of blocks B_1, B_2, \cdots, B_k, where the blocks are pairwise disjoint, and \cup_{i=1}^k B_i = P. Let {\rm fit}(B) be the fit of the block B. The goal is to find an optimum partition B_1, \cdots, B_k, maximizing \sum_{i=1}^k {\rm fit}(B_i) (the value k is not given, it is also determined by the algorithm).

First renumber the points in P by their x-coordinates. The recursion formula is straightforward: let {\rm BestFit}(i) denotes the optimum partition of the points in P[1..i] (the leftmost i points in P), for 1 \leq i \le n and let {\rm BestFit}(0) = 0. Then {\rm BestFit}(i+1) = \max_{1 \leq j \leq i+1}\{{\rm BestFit}(j-1) + {\rm fit}(P[j..i+1])\}. There are totally n entries to compute, and we spend O(i) time for the i^{\rm th} entry, so the overall time complexity is O(n^2). Note that we assume that the function {\rm fit} can be obtained in O(1) time for a block, which usually is true (by possibly preprocessing).

An almost identical problem is segmented least square. Given a set P of n points in the plane, we would like to partition P into k blocks, and fit each block with a line. The overall cost the partition is the summation of the costs from each block.

This problem is a generalization of linear regression: in linear regression, the cost of fitting a point set P of size n to a line l_{a,b}: y = ax + b is
{\rm fit}(P, l_{a,b}) = \sum_{p \in P}(p_y - a p_x -b)^2, where p = (p_x, p_y). The “best” coefficients a and b, that is, minimizing {\rm fit}(P, l_{a,b}), has closed forms:
a = \frac{n \sum_{p \in P}p_x p_y - (\sum_{p \in P}p_x)(\sum_{p \in P} p_y)}{n \sum_{p \in P}p_x^2 -(\sum_{p \in P}p_x^2)},

b = \frac{\sum_{p\in P} p_y - a\sum_{p \in P}p_x}{n}.

However, sometimes the data points lie on more than one lines. Hence we would like to discover this fact (for example, we might be interested in (1) the number of lines; (2) the points where the line changes direction — these kind of points are called change points in the previous Bayesian block partition problem) and find the optimum cost. A human being can easily determine this sometime, but naturally there are situations where the number of lines we need to “best fit” the point set is quite unclear.

We now formulate this problem into an optimization problem. Given a partition \mathcal{P} = \{B_1, \cdots, B_k\} of the point set into segment/block, where each segment consists of a subset of P contiguous in x-axis, , let l_i^{\ast} be the optimum line fitting B_i, the cost of the partition is
{\rm cost}(\mathcal{P}) = Ck + \sum_{i=1}^k {\rm fit}(B_i, l_i^{\ast}),
where C is a positive real number, which encodes the penalty for partitioning P into k parts (without this term, we could fit each single point with a line, causing a trivial and meaningless optimum solution of cost 0). The goal is to find an optimum partition \mathcal{P} minimizing {\rm cost}(\mathcal{P}).

The dynamic programming approach is the following. First, we renumber the points in P according their x-coordinate (in ascending order). The recursion formula is the following: define {\rm BestFit}(0) = 0, and define {\rm BestFit}(i) to be an optimum partition of the first i points of P. The recursion formula is almost identical to the Bayesian Block partition problem:
{\rm BestFit}(i+1) = \min_{1 \leq j \leq i+1}({\rm fit}(P[j..i+1]) + C + {\rm BestFit}(i)), for 0 \leq i \leq n-1, where {\rm fit}(P[j..i+1]) is the optimal cost to fit the points in P[j..i+1] with a single line (we are slightly abusing the notation here).

The time complexity is O(n^2), following the same reasoning. One thing to note is that we need to do some preprocessing on P, so that we can quickly compute {\rm fit}(P[i..j]) in O(1) time.

Dynamic Programming Practice (4)

Shuffle of strings A shuffle of two strings X and Y is formed by interspersing the characters into a new string, keeping the characters of X and Y in the same order.

Example: the string “prodgyrnamammiincg” is a shuffle of “dynamic” and “programming”.

Given three strings S_1[1..m], S_2[1..n] and S[1..n+m], describe and analyze an algorithm to determine whether C is a shuffle of A and B.

This problem is from these lecture notes.

This is a dynamic programming problem. The recursion formula is straightforward: define {\rm Shuffle}(i, j) to be true if S[1..i+j] is a shuffle of S_1[1..i] and S_2[i..j], and false otherwise, for 0\leq i \leq m and 0 \leq j \leq n.

(1) {\rm Shuffle}(0,0) = {\rm true}. (An empty string is a shuffle of two empty strings).
(2) If j =0, then {\rm Shuffle}(i,0) is true if and only if the string S_1[1..i] is a prefix of S[1..n+m], for 1 \leq i \leq m.
(3) If i = 0, then {\rm Shuffle}(0,j) is true if and only if the string S_2[1..j] is a prefix of the string S, for 1 \leq j \leq n.
(4) Now assume that both S_1[1..i] and S_2[1..j] are non-empty strings, then:

  • If S_1[i] \neq S_2[j], then {\rm Shuffle}(i,j) equals to {\rm Shuffle}(i-1, j) if S[i+j] = S_1[i], and {\rm Shuffle}(i, j-1) if S[i+j] = S_2[j]; otherwise it is false.
  • If S_1[i] = S_2[j], then {\rm Shuffle}(i,j) = {\rm Shuffle}(i-1, j) || {\rm Shuffle}(i, j-1).

{\rm Shuffle}(m,n) is the final answer.

The time complexity of the algorithm is O(mn), the space complexity is O(\min\{m,n\}). Here is the result of the execution of the algorithm on the example {\rm Shuffle}("{\rm prodgyrnamammiincg}", "{\rm dynamic}", "{\rm programming}").
Shuffle of strings
Note that we do not need to explicitly maintain the whole table in the implementation: we fill up the table column by column from left to right, and for each column, the entries are filled from north to south. We only need to maintain one column of the table (the “sliding window” method).

One thing to note in this table is that most of the entries seem to be false, so is there any linear time (that is, O(n+m)) algorithm for this problem?

A probability puzzle: loaded dice

I came across this question on interview street:

Random number generator
There is an ideal random number generator, which given a positive integer M can generate any real number between 0 to M with equal probability.
Suppose we generate 2 numbers x and y via the generator by giving it 2 positive integers A and B, what’s the probability that x + y is less than C? where C is a positive integer.

Input Format
The first line of the input is an integer N, the number of test cases.
N lines follow. Each line contains 3 positive integers A, B and C.
All the integers are no larger than 10000.

Output Format
For each output, output a fraction that indicates the probability. The greatest common divisor of each pair of numerator and denominator should be 1.

Input
3
1 1 1
1 1 2
1 1 3

Output
1/2
1/1
1/1

One way to consider this problem is by generating functions. Suppose the distribution is not uniform, is there any efficient way to compute the probability that the sum of the two randomly generated numbers is no greater than C?

Suppose for the first random number generator, the probability of that the randomly generated number, X, equals to i, is p_i, for 0 \leq i \leq A, and similarly, for the second random generator, the probability that the randomly generated number Y is i is q_i, 0 \leq i \leq B Define two polynomials P(x) = \sum_{j=0}^A p_j x^j and Q(x) = \sum_{j = 0}^B q_j x^j. Then consider the multiplication of P(x)Q(x):
P(x)Q(x) = \sum_{k = 0}^{A+B}\left(\sum_{i+j = k, 0 \leq i \leq k, 0\leq j \leq k}p_i q_j x^k\right),
The k^{\rm th} coefficient in the right hand side of the above equation is precisely the probability of X+Y=k. So the probability that X+Y \leq k is the summation of the first k+1^{\rm th} terms. In the problem of interview street, since the distribution is uniform, each p_i is 1/(A+1), and each q_i is 1/(B+1). So we can easily obtain the probability that X+Y < C.

The probability puzzle (using generating functions) is the following:
Loaded dice: No matter what two loaded dice we have, it cannot happen that each of the sums 2,3, …, 12 comes up with the same probability.
(I first came across this puzzle from the book “The art of mathematics”).

We can reason like this: let p_i denotes the probability that the result of the first dice is i, for 1 \leq i \leq 6. Similarly, let q_i denotes the probability that the result of the second dice is i, 1 \leq i \leq 6. Then the probability that the sum of the two results is 2 is p_1 q_1=1/11 (the first dice and second dice must both output 1). The probability that the sum is 12 is p_6 q_6=1/11 (both dices must output 6). Then consider the probability that the sum is 7: it is at least p_1 q_6 + p_6 q_1 (the first dice outputs 1, and the second dice outputs 6, or the reverse). But p_1 q_6  + p_6 q_1 = (1/11) \times (p_1/p_6 + p_6/p_1) \ge 2/11.

This reasoning works, but the book provides a much better proof using generating functions. As before, let P(x) = \sum_{i = 1}^6 p_i x^{i-1} and let Q(x) = \sum_{i=1}^6 q_i x^{i-1}. Now if the sums 2,3,…,12 comes with the same probability, 1/11, then we have
11 P(x)Q(x) = 1+x+x^2+x^3+...+x^{10} = (x^{11}-1)/(x-1). Now since both P(x) and Q(x) are polynomials of degree 5, 11 P(x)Q(x) has real roots; however, the right hand side is a cyclotomic polynomial which does not have any real root. Thus the identity cannot hold.

Here is another application of generating functions for computing probabilities.

Counting heads Given integers n and k, along with p_1, ..., p_n \in [0,1], you want to determine the probability of obtaining exactly k heads when n biased coins are tossed independently at random, where p_i is the probability that the i^{\rm th} coin comes up head. Assume that you can multiply and add two numbers in [0,1] in O(1) time.

(This problem is from this book.)

This problem can be solved using dynamic programming: Let X_i be the indicator random variable for the event that the result of tossing i^{\rm} coin is head, for 1 \leq i \leq n. Notice that
{\rm Pr}(\sum_{i=1}^n X_i = k) = {\rm Pr}(\sum_{i = 1}^{n-1} X_i = k-1 {\rm AND } X_n = 1) + {\rm Pr}(\sum_{i=1}^{n-1} X_i = k {\rm AND } X_n =0)   = p_n {\rm Pr}(\sum_{i=1}^{n-1} X_i = k-1) + q_n {\rm Pr}(\sum_{i=1}^{n-1} X_i = k).

The time complexity is O(n^2). However, if we consider the approach generating functions, we can get a faster algorithm using fast fourier transform:
Let P_i(x) = q_i + p_i x, for 1 \leq i \leq n. Then consider the product of the n polynomials of degree 1: \prod_{i = 1}^n P_i(x) = \sum_{j = 0}^n c_j x^j. The probability of obtaining k heads is the coefficient c_k. Therefore, the problem is reduced to how fast we can multiply polynomials. Using fast fourier transform, we can multiply two polynomials of degree at most n in time O(n log n). Hence, we can use divide-and-conquer to multiply these n polynomials:
\prod_{j=1}^n P_i(x) = \left(\prod_{j=1}^{n/2} P_i(x)\right) \times \left(\prod_{j=n/2+1}^n P_i(x)\right).
Let T(n) denotes the time to multiply n polynomials of degree 1. Then we have the following recursion:
T(n) = 2 T(n/2) + O(n log n),
as in the “conquer” step, we are multiplying two polynomials of degree n/2. Hence T(n) = O(n (log n)^2).

Coin Exchange: Dynamic Programming Practice (3)

Coin Exchange: Given a set of m denominations {d_0, \cdots, d_{m-1}}, and a target value k, find the smallest number of denominations whose sum equals to k. For example, if the denominations are \{1,4,7,13,28,52,91,365\}, and the target is 122, then the optimum way to pick the denominations is 3 \times 1 + 28 + 91, so the number of bills we need is 5.

This is a very common dynamic programming problem. To simply things a bit, we assume that the smallest denomination is 1, which ensures that for any target value, there is at least one way to get that target (a sequence of {\rm target} 1’s). W.l.o.g., assume that d_1 < d_2 < \cdots d_{m-1}. The recursion formula is the following: define {\rm MinBill}(x, i) to be the smallest number of bills we need to get x, with the constraint that the largest denomination we can use is d_i, for 0 \leq i \leq m-1. The base cases are: {\rm MinBill}(x, 0) = x, for x \in \mathbb{N}; {\rm MinBill}(x, i) = 0 for x \leq 0. For x \geq 1 and 1 \leq i \leq m-1, if x < d_i, then {\rm MinBill}(x, i) = {\rm MinBill}(x,i'), where i' is the largest index satisfying d_{i'} \leq x; otherwise, {\rm MinBill}(x, i) = \min\{1+{\rm MinBill}(x-d_i, i), {\rm MinBill}(x,i-1)\}.

Time Complexity/Space complexity: there are several ways to implement the dynamic programming algorithm. If we use the filling-up table approach to implement the algorithm, the time complexity is usually O(m N), where m is the number of denominations and N is the target value. So it is a psedu-polynomial time complexity algorithm. The space complexity is O(N). In the code below, I used “recursion + memoization” approach (the reason is given here).

#include <vector>
#include <algorithm>
#include <unordered_map>
#include <iostream>
using namespace std;

size_t hasher(const pair<int, int> &p)
{
    return hash<int>()(p.first) + hash<int>()(p.second);
}

class CoinExchange {
public:
    CoinExchange(const vector<int> &Denominations): Denom {Denominations}, cache(Denom.size(), hasher) {sort(Denom.begin(), Denom.end());}
    int operator()(int k);
    int MinCoin(int target, int largest);
private:
    vector<int> Denom;
    unordered_map<pair<int, int>, int, decltype(hasher) *> cache;
};

int CoinExchange::MinCoin(int target, int largest)
{
    if (target <= 0)
        return {};

    if (!largest)
        return target;

    if (target < Denom[largest]) {
        largest = upper_bound(Denom.begin(), Denom.end(), target) - Denom.begin() - 1;
        return MinCoin(target, largest);
    }

    if (cache.find({target, largest}) != cache.end())
        return cache[{target, largest}];
 return cache[{target, largest}] = min(1 + MinCoin(target - Denom[largest], largest), MinCoin(target, largest - 1));
}

int CoinExchange::operator()(int target)
{
    return MinCoin(target, Denom.size()-1);
}

int main()
{
    vector<int> denominations {1,4,7,13,28,52,91,365};
    int target;
    cin >> target;
    CoinExchange coinexchange {denominations};
    cout << coinexchange(target) << endl;
}

Update:
See here for a very well-explained article for solving the coin exchange problem in O(1) time.

Dynamic Programming Practice (2)

Maximum set of non-intersecting lines Consider two horizontal lines l_1 and l_2 in the plane. Suppose we are given the x-coordinates of n distinct points a_1, a_2, \cdots, a_n on l_1 and n distinct points b_1, b_2, \cdots, b_n on l_2. Design an algorithm to compute the largest set S of non-intersecting line segments satisfying to the following restrictions:

(a) Each segment in S connects some point a_i to the corresponding point b_i.
(b) No two segments in S intersect.

This problem is from the chapter of dynamic programming of these lecture notes.

It is quite easy to derive a recursion formula for this problem: let S denotes n pairs of segments s_i corresponding to (a_i, b_i), 1 \leq i \leq n. Then the maximum set of non-intersecting segments of P is {\rm OPT}(P) = \max\{{\rm OPT}(P\setminus \{s_1\}), 1+{\rm OPT}(P\setminus {\rm Intersect}(s_1))\}, where {\rm Intersect}(s_i) is the set of segments in P intersecting s_i. However, in order to use the dynamic programming approach, it is not clear how the subproblems overlap and in what order should we evaluate them.

An alternative approach is to formulate the problem as a graph problem: each segment s_i correspond to a vertex in the graph, and two vertices are connected by an edge if the segment corresponding to them intersect. Then the problem is equivalent to finding a maximum independent set in the graph. However, only when the graph instance has some special properties, we can compute the maximum independent set in polynomial time easily (via dynamic programming); for example, when the underlying graph is a tree. But the graph representing the intersections of segments clearly does not have such a property.

So how to use dynamic programming to solve this problem? The key observation is the following: suppose we renumber the indices of \{a_1, \cdots, a_n\}, so that they are ordered from left to right, and we also renumber the corresponding indices of b_i‘s (so when we change the index of some a_k to a_i, we also change b_k to b_i). Then we looked at the indices of b_i‘s. The key observation is that a maximum non-intersecting set of segments correspond to the longest increasing subsequence. The renumber step can be done in O(n log n), and it takes O(n log n) time to find the longest increasing subsequence. So the overall time complexity is O(n log n).

Largerst k elements in a stream

Given a sequence of numbers, whose length is unknown, find the k largest elements.

This problem is an interesting variant of the largest $k$ elements problem, which is usually phrased as the following: given a sequence of n numbers, and a number k, find the k largest elements. There are two ways to solve this problem: (1) use the selection algorithm to find the element of rank k in linear time, and then do another scan to find all elements no larger than the k^{\rm th} rank element. The time complexity is O(n), space complexity O(n). (2) we can reduce the space complexity by maintaining a min heap of size k. For each element after the k^{\rm th} element, if it is smaller than the top element in the heap, discard it, otherwise, pop out the current smallest element in the heap, and push this new element into the heap. The time complexity is O(n log k), and the space complexity is O(k).

How to achieve the time complexity O(n) while still using O(k) space? The solution is the following: use an array of size 2k - 1. If the array is not filled up, add new element into the array. If the array is filled up, use the median selection algorithm to find the median of current 2k-1 element, and discard the k-1 elements which are greater than the median. This takes O(k) time. The array will be filled up O(n/k) times, so the time complexity is O(n/k) \cdot O(k) = O(n).

I came across this solution from Algorithms Notes.

Breaking a string into words

You are given a string of n characters s[1..n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like “itwasthebestoftime…”). You wish to reconstruct the document using a dictionary, which is available in the form of an Boolean function dict(\cdot): for any string w, dict(w) returns true if w is a valid word, and false otherwise.
Given a dynamic programming algorithm that determines whether the string s[\cdot] can be reconstituted as a sequence of valid words. The running time should be at most O(n^2), assuming calls to dict(\cdot) takes unit time.

The question is an exercise from “Algorithms” by S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani. It is a typical dynamic programming problem.

The recursion is straightforward: define ValidSentence(S[0..n)) to be a function over strings, where ValidSentence(S[0..0)) = {\rm true} (the empty string is considered as a valid word in dictionary, and thus a valid sentence vacuously); for n \geq 1,
ValidSentence(S[0..n)) = {\rm OR}_{1 \leq i \leq n} (dict(S[0..i)) \&\& ValidSentence(S[i..n)).
(The string S[0..n) can only be broken into valid words if and only if there exists some index j > 0, such that the string S[0..j) is a word in the dictionary, and the remaining string can be broken into a sequence of valid words). We can slightly simplify the notation by dropping the end index of the string:

ValidSentence(n) = {\rm true}
ValidSentence(i) = {\rm OR}_{i+1 \leq j \leq n} (dict(i..j) \&\& ValidSentence(j)) for 0 \leq i \leq n-1

Then we only need to compute ValidSentence(0).

The code below is “recursion + memoization” (note that “memoization” should not be confused with “memorization”). In the past I have always thought that “recursion + memoization” is roughly equivalent to iteration (since it is also very straightforward to implement the recursion in an iterative fashion). I have always slightly preferred the iterative fashion over recursion calls, since the latter involve extra bookings such as  pushing parameters on the stack and returning value, etc., though it is indeed much easier to code, and mathematically more concise than the iterative form. But the authors in the book point out an interesting situation where recursion + memoization might be better:

In some cases, though, memoization pays off. Here’s why: dynamic programming automatically solves every subproblem that could conceivably be needed, while memoization only ends up solving the ones that are actually used. …..The memoization recursive algorithm will never look at these extraneous table entries.

bool validsentence(const string &s, size_t i, unordered_map<size_t, bool> &cache, const Dictionary &dictionary)
{
if (i == s.size())
return true;

if (cache.find(i) != cache.end())
return cache[i];

bool res {};
for (size_t j = i+1; j <= s.size(); ++j)
res = res || dictionary.isword(string(s.cbegin() + i, s.cbegin() + j)) && validsentence(s, j, cache, dictionary);
cache[i] = res;
return res;
}

bool validsentence(const string &s, const Dictionary &dict)
{
unordered_map<size_t, bool> cache;
return validsentence(s, 0, cache, dict);
}

Longest subarray whose sum <= k

Longest subarray whose sum \leq k
Given an array A of n numbers and a key k, find the longest subarray of A for which the subarray sum is less than or equal to k.

This problem is from the book “Elements of Programming Interviews”.

This problem has a rather easy quadratic time complexity algorithm: check each pair of indices i < j, where 0 \leq i < j \leq n-1 to see if the sum of the subarray A[i..j] is no greater than k, and return the pair with the largest difference j-i+1. To avoid repeatedly computing the subarray sum, one can first preprocess the array to get an array S of partial sums, where S[i] = \sum_{j = 0}^{i-1} A[j], for i = 1,\cdots, n, and by definition, S[0] = 0. Then the subarray sum can be obtained as A[i..j] = S[j+1] - S[i]. The preprocessing takes linear time and linear space.

In the following we describe an algorithm with time complexity O(n \log n).

Algorithm LONGESTSUBARRAY(A[0..n-1], k)
Input: An array A[0..n-1] and a real number k
Output: A pair of indices i, j, maximizing j - i + 1, subject to \sum_{t = i}^j A[t] \leq k
1. Compute the array S[0..n] of partial sums, where S[i] is the prefix sum of the first i numbers in A. S[0] \leftarrow 0.
2. (l, r) \leftarrow (0,0)
3. M \leftarrow \emptyset
4. smallest \leftarrow +\infty
5. for i \leftarrow n down to 0
    if S[i] < {\rm smallest}
        M \leftarrow M \cup \{(S[i], i)\}
        {\rm smallest} \leftarrow S[i]
6. {\rm current} \leftarrow -\infty
7. for i \leftarrow 0 up to n-1
    if S[i] > {\rm current}
        Use binary search to look for the rightmost element (S[t], t) in M such that S[t] - S[i] \leq k. If such an element exists, update (l, r) to (i, t) if t-i \geq r - l
        {\rm current} \leftarrow S[i]
8. return (l,r)

Correctness of the Algorithm
The key is to observe the following two facts.
Claim: (1) If two indices r < r' satisfies that S[r] \geq S[r'], then r cannot appear in an optimum solution; (2) If two indices l < l' satisfies that S[l] \geq S[l'], then l' cannot appear in an optimum solution.
Proof: For any index i, 0 \leq i \leq n, note that the subarray sum of A[i..r'] is less than the subarray sum of A[i..r], and the length of the subarray $A[i..r’]$ is greater than the length of the subarray A[i..r] (although both length might be negative, but the statement still holds). Therefore, (i,r') is preferable over (i, r). This is also the reason why we compute the strictly increasing sequence of M.
The second statement follows a similar reasoning.

Time Complexity and Space Complexity
Step 1, computing the prefix sums, takes O(n) time. Step 4, computing the array M, takes O(n) time. Step 7 takes O(n \log n) time as each iteration takes O(\log n) time.

Task Scheduling and Balanced Binary Search Tree

The Task Scheduling problem is from Hackerrank.

Task Scheduling
Given a list of n tasks \{(d_i, m_i)\}, where the i^{\rm th} task has deadline d_i and duration m_i, for each i, i = 1,\cdots, n, compute the minimum maximum lateness of the first i tasks.

First we introduce some notations which will be used in the following discussion. Given a set n tasks, a schedule of the tasks is a sequence t_{i_1}, t_{i_2},\cdots, t_{i_n}, where each i_j is in [n]. In other words, the sequence, i_1, \cdots, i_n is a permutation of [n]. For example, a schedule for t_1, t_2, t_3, could be t_2, t_3, t_1 (we first process task t_2, then t_3, lastly, t_1). For a schedule t_1, \cdots, t_n, the start time for t_i, s_i, is \sum_{j = 1}^{i-1} m_j (the summation of durations of all the tasks before task t_i in the schedule), and the finish time f_i is s_i + m_i (the start time for t_i plus the time needed to process t_i). The lateness of t_i, l_i for t_i in this schedule is the difference between its finish time and its deadline, f_i - d_i. Here is an example (the input is from Hackerrank). Suppose we have 5 tasks, \{(2,2), (1,1), (4,3), (10,1), (2,1)\}. The first task t_1 has a deadline at 2, and we need 2 time units to process t_1; similarly, the second task t_2 has a deadline at 1, and its duration is 1, and so on. A valid schedule for these 5 tasks might be, t_2, t_1, t_5, t_3, t_4. The start time s_2, for t_2 is 0, and the finish time f_2 is 1, the lateness l_2 is f_2 - d_2 = 1-1 = 0; the start time s_1 for task t_1 is 1 (which is the finish time of the task just precedes it), and the finish time f_1 is s_1 + m_1 = 1 + 2 = 3, so it has lateness f_1 - d_1 = 3 - 2 =1. Similarly, we can compute the start time/finish time/lateness for the rest of the tasks: s_5 = 3, f_5 = 4, l_5 = 2; s_3 = 4, f_3 = 7, l_3 = 3; s_4 = 7, f_4 = 8, f_4 = -2. So the maximum lateness, \max_{1 \leq i \leq 5} l_i is 3, which is the lateness of t_3.

A Greedy Algorithm to Process a Set of Tasks
If we are given n tasks, and our goal is to find an optimum schedule for the n tasks, we can solve this problem using the greedy algorithm: sort the tasks by deadlines in non-decreasing order, and that order is the schedule minimizing the lateness. The time complexity is O(n log n).

The algorithm, which we will refer to {\rm GREEDYSCHEDULE}, is the following:

\underline {\rm GREEDYSCHEDULE}
Input: a set of n tasks, each task t_i is specified by (d_i, m_i)
Output: a schedule t_{i_1}, t_{i_2}, \cdots, t_{i_n}, where i_j \in [n], and the maximum lateness l^{\ast} = \max_{1 \leq i \leq n} l_i.
1. Sort the tasks by the deadlines, and renumbering these tasks. Let \mathcal{T} be the new sequence of tasks.
2. f \leftarrow 0
3. l^{\ast} \leftarrow 0
4. for each t_i in  \mathcal{T}
f \leftarrow f + m_i
if  (f - d_i > l^{\ast})
l^{\ast} \leftarrow f - d_i
5. Output l^{\ast} and \mathcal{T}

The time complexity of the above algorithm is O(n log n), as the first step, sorting, takes O(n log n). The rest takes O(n).

An Incremental Algorithm to Process n sets of tasks
Let us now consider the Task Scheduling problem from Hackerrank. Here we need to compute the maximum lateness for n sets of tasks, that is, the first i tasks, for each i =1, \cdots, n. For notational convenience, in the following, we use T_i to denote the first i tasks. Note that T_1 \subset T_2 \subset \cdots T_n. The possible solutions might be:

  • The most naive method. Repeat {\rm GREEDYSCHEDULE} for each T_i, for i = 1,\cdots, n. The time complexity is \sum_{i = 1}^n O(i log i), which is O(n^2 log n).
  • The relatively less naive method. Note that we don’t need to repeatedly do the sorting step: once we have the optimum schedule for T_i, we can use binary search to find the position to insert t_{i+1}, which takes O(log i). But we still have to update the lateness. How can we do this? Since all the tasks after the newly inserted task is delayed further by m_{i+1} time units, their lateness might increase. So if we scan all i+1 tasks to find the new maximum lateness, this would take O(i) time. So the overall time complexity is \sum_{i = 1}^n (O(log i) + O(i)), which is O(n log n) + O(n^2), which is O(n^2).

In the following, we describe an algorithm that takes O(n log n) time and O(n) space. Note that the main difficulty is to update the lateness, which in turn involves two subproblems: (1) compute the lateness of the newly inserted task t_{i + 1}, so that we can compute its latesness l_{i+1}, and (2) compute the maximum lateness of all the tasks that are pushed m_{i+1} time units further to the right along the time line, because of the insertion of t_{i + 1}.

Let us first consider the first subproblem, which can be solved in logarithmic time in many ways. The easiest method to think of, is probably to use a balanced binary search tree (but note that we will use a better approach — a simpler one, although with the same time complexity). Given a list of tasks, we can build a vanilla balanced binary search tree \mathcal{B} of the tasks, where the key is the deadline (this would allow us to find the correct position for the new task t_{i + 1} in O(\log i) time, as the tree height of a balanced binary search tree with n leaves is O(\log n)). We can augment \mathcal{B} in the following way: store in each node not only d_i and m_i, but also the summation of the durations m_i of all its descendents. This is the time that is needed to process all tasks in the subtree rooted at this node. So each node v in \mathcal{B} has three values, v.deadline, v.duration and v.overallduation.  When we insert a new task t_{i+1} into \mathcal{B}, we can update this field of all ancestors of t_{i+1} and compute the finish time of t_{i+1} by the time we reach the bottom of \mathcal{B} simultaneously: initialize f with 0. We will accumulate the time needed to process all tasks before t_{i+1} in f. Start from the root of \mathcal{B}, for each node v we touched, add m_{i+1} to v.overallduration. If the deadline of the new task, d_{i+1} is greater than or equal to v.deadline, descend to the right, and add the overallduration of the left child of v and v.duration to f, as the tasks in the left subtree of v, and v itself, are processed before t_{i+1} in the schedule; otherwise, v.deadline is less than d_{i+1}, in this case we descend to the left child, and do not add any value to f. When we reached the bottom of \mathcal{B}, we insert a new node u, for t_{i+1}, u.deadline = d_{i+1}, u.duration = m_{i+1}, and u.overallduration = m_{i+1} (as u has no descendent other than itself). f is now the partial sum \sum m_i, where the summation is over all the predecessors of u, in other words, all the tasks processed before t_{i+1} in the current schedule. With the finish time f for t_{i+1}, we can compute the lateness of t_{i+1} in the current schedule as l_{i+1} = f-d_{i+1}. The time complexity is linear in the height of \mathcal{B}, as we only spend O(1) time per node as we walk down the tree. Since \mathcal{B} is balanced, the time complexity is thus O(\log i).

Let us now consider the second subproblem. How can we compute the maximum lateness of current schedule? Suppose the maximum lateness occurs after t_{i+1}, then we can easily get the new maximum lateness: consider any task in T_i. If this task is before t_{i+1} in the new schedule, then the insertion of t_{i+1} has no effect on its lateness. The insertion of t_{i+1} causes a shift for all the tasks in T_i preceded by t_{i+1} by m_{i+1} time units. Therefore, their maximum lateness, l^{\ast}_i, is now increased by m_{i+1}. Hence, the new maximum lateness is \max\{l^{\ast}_i + m_{i+1}, l_{i+1}\}. What about the case where the maximum lateness occurs before t_{i+1}? Because the lateness of all the tasks preceded by t_{i+1} has increased, it is possible that one of them assumes the maximum lateness. In the worst case, if t_{i+1} is inserted to the very front of the schedule, there would be i tasks after t_{i+1}, even if we do not have to recompute the lateness of each of them,  it is quite inefficient to get the maximum lateness by scanning each of them.

There are two optimizations we could make here. The first one is an observation of the lateness, the other is the use of Fenwick Tree. Let us first consider the following question: in a schedule of m tasks, t_1, \cdots, t_m, ordered by their deadlines, which of these tasks might assume the maximum lateness after an insertion of a new task? Suppose we have a sequence of lateness l_1, \cdots, l_m, where l_i is the lateness of task t_i in the schedule. We make the following claim:

Claim: Consider a pair of tasks t_i and t_j, where i < j (that is, task t_i is precedes task t_j in the schedule). If l_i \leq l_j, then task t_i can never have a larger lateness that task t_j.

Proof: The reason is that (1) if the new task is inserted after t_j, both t_i and t_j are not affected; their lateness is the same after the insertion; (2) if the new task is inserted after t_i and before t_j, the lateness of t_i, l_i remains the same, while l_j increased; (3) if the new task is inserted before t_i, then both t_i and t_j is shifted further to the right, by an amount of m — the duration of the new task, hence if l_j is no smaller than l_i, it will remain so after the insertion.

This observation allows us to store a sub-sequence of the lateness sequence \{l_1, \cdots, l_m\}, which is \{l_{i_1}, l_{i_2}, \cdots, l_{i_k}\}, where l_{i_1} > l_{i_2} >\cdots>l_{i_k} and i_1, \cdots, i_k is a sub-sequence of 1,\cdots, m. Let us denote this list of lateness as \mathcal{L}. Now we can compute the maximum lateness after the insertion of the new task: find the position where the lateness of t_{i+1} would occur in \mathcal{L}. Suppose i_j and i_{j'} are the two indices. The maximum lateness is the maximum of the lateness before t_{i_j} (as all tasks before and including i_j precedes task t_{i+1}, so their lateness remain the same after the insertion), the lateness of t_{i+1}, and the maximum lateness of all the tasks after and including t_{i_{j'}} plus m_{i+1}, which is the duration of the new task. Since the elements in \mathcal{L} is in strictly decreasing order, we can easily the get the maximums of the two parts in \mathcal{L}: the very first element of each part. Therefore, determining the maximum lateness is O(1). However, we need to update the list \mathcal{L}: namely, we need to add a value of m_{i+1} to each entry after and including l_{i_{j'}}, and possibly inserting l_{i+1} (the lateness of the new task) before l_{i_j'}, and remove all the elements in the left part of \mathcal{L} in order to maintain the strictly increasing order in \mathcal{L}. (By the way, even if we don’t optimize further, and just naively update \mathcal{L}, the program already can pass all the tests on Hackerrank).

In the worst case, the size of \mathcal{L} still could be in the order of the number of all the tasks. So the overall time complexity to update \mathcal{L} could be large (for example, if t_{i+1} is inserted before all the tasks in T_i and each task in the optimum schedule of T_i happens to have a decreasing lateness). While practically fast enough, the time complexity is still not O(n log n) yet. To avoid many updates in \mathcal{L} after an insertion, we use a simple trick: instead of explicitly storing the lateness in the list, we store the deadline. That is, if l_i appears in \mathcal{L}, in the same position, we would store d_i, which is the deadline of task t_i. The deadline would serve as an id for each task. Using the deadline, we can find the finish time for the task with deadline d_i and hence compute its lateness in the current schedule in time O(log i).

Now in an insertion, there might be several queries on the balanced binary search tree \mathcal{B} for the finish times, namely, in the set of  tasks precedes i+1 in the schedule, which also appears in \mathcal{L}, we have to find those tasks whose lateness are smaller than the current maximum lateness and throw them out from \mathcal{L}. This can be done by scanning the elements in \mathcal{L} reversely from the point where i+1 is inserted, till we find a task that precedes i+1 with a larger lateness, or the list is exhausted. Since each task can only be thrown away once (if it is evicted from \mathcal{L}, it would never appear in \mathcal{L}), therefore, the overall costs of all insertions is O(n \log n).

After Thoughts/Alternatives
Instead of using balanced binary search tree, we could also use Fenwick Tree. Fenwick tree allows one to update an array and query partial sums on the array efficiently: each operation takes O(\log n) time. In some sense, it is equivalent to balanced binary search tree. But it is much easier to implement . Although conceptually not too hard, it still takes much more lines to code an augmented balanced binary search tree than Fenwick Tree. There is a balanced binary search tree in STL, but currently I don’t know any good to augment it easily. In contrast, Fenwick tree can be easily coded using array or hash table (unordered_set in STL). Of course, we could also use a regular augmented binary search tree, without the rotation operations to keep the tree balanced, and rely on the fact that if the keys of the input for constructing the binary search tree is random, then the tree height has expectation O(\log n).