August | 2013 | e2718281828459045

August 31, 2013

Dynamic Programming in Practice: Optimal Data Segmentation

I have been reviewing dynamic programming recently, and today I saw a nice application of dynamic programming used in Astrophysics: Dynamic Programming in Python: Bayesian Blocks.

The following is the problem in astrophysics phrased in a purely layman’s way: given a set of data points in the plane, each point is a pair — (time point, observation value), we would like to extract as much information as possible from these observations — which roughly speaking, involves computing a histogram of the observations. The difficulty here is determining the bin size: if the bin size is too large, we loose important information; however, at the same time, we would like to suppress the noise in the observations. It seems that one approach used in the past is by adjusting the bin size manually by a trial-and-error fashion: if the histogram is too jagged, then decrease the bin size; and if it is too smooth, increase the bin size.

The novel idea in this paper is that one first decides a fitness function/cost function over partitions of data using a Bayesian likelihood framework, and then tries to compute an optimum partition of the data points according to this function.

Here is the same problem phrased in the language of computer science:

Let $P$ be a set of $n$ points in the plane. A block $B$ of points in $P$ is a set of points consecutive in $x$ -axis. A partition of $P$ into block is specified by a sequence of blocks $B_1, B_2, \cdots, B_k$ , where the blocks are pairwise disjoint, and $\cup_{i=1}^k B_i = P$ . Let ${\rm fit}(B)$ be the fit of the block $B$ . The goal is to find an optimum partition $B_1, \cdots, B_k$ , maximizing $\sum_{i=1}^k {\rm fit}(B_i)$ (the value $k$ is not given, it is also determined by the algorithm).

First renumber the points in $P$ by their $x$ -coordinates. The recursion formula is straightforward: let ${\rm BestFit}(i)$ denotes the optimum partition of the points in $P[1..i]$ (the leftmost $i$ points in $P$ ), for $1 \leq i \le n$ and let ${\rm BestFit}(0) = 0$ . Then ${\rm BestFit}(i+1) = \max_{1 \leq j \leq i+1}\{{\rm BestFit}(j-1) + {\rm fit}(P[j..i+1])\}$ . There are totally $n$ entries to compute, and we spend $O(i)$ time for the $i^{\rm th}$ entry, so the overall time complexity is $O(n^2)$ . Note that we assume that the function ${\rm fit}$ can be obtained in $O(1)$ time for a block, which usually is true (by possibly preprocessing).

An almost identical problem is segmented least square. Given a set $P$ of $n$ points in the plane, we would like to partition $P$ into $k$ blocks, and fit each block with a line. The overall cost the partition is the summation of the costs from each block.

This problem is a generalization of linear regression: in linear regression, the cost of fitting a point set $P$ of size $n$ to a line $l_{a,b}: y = ax + b$ is
${\rm fit}(P, l_{a,b}) = \sum_{p \in P}(p_y - a p_x -b)^2$ , where $p = (p_x, p_y)$ . The “best” coefficients $a$ and $b$ , that is, minimizing ${\rm fit}(P, l_{a,b})$ , has closed forms:
$a = \frac{n \sum_{p \in P}p_x p_y - (\sum_{p \in P}p_x)(\sum_{p \in P} p_y)}{n \sum_{p \in P}p_x^2 -(\sum_{p \in P}p_x^2)}$ ,

$b = \frac{\sum_{p\in P} p_y - a\sum_{p \in P}p_x}{n}$ .

However, sometimes the data points lie on more than one lines. Hence we would like to discover this fact (for example, we might be interested in (1) the number of lines; (2) the points where the line changes direction — these kind of points are called change points in the previous Bayesian block partition problem) and find the optimum cost. A human being can easily determine this sometime, but naturally there are situations where the number of lines we need to “best fit” the point set is quite unclear.

We now formulate this problem into an optimization problem. Given a partition $\mathcal{P} = \{B_1, \cdots, B_k\}$ of the point set into segment/block, where each segment consists of a subset of $P$ contiguous in $x$ -axis, , let $l_i^{\ast}$ be the optimum line fitting $B_i$ , the cost of the partition is
${\rm cost}(\mathcal{P}) = Ck + \sum_{i=1}^k {\rm fit}(B_i, l_i^{\ast})$ ,
where $C$ is a positive real number, which encodes the penalty for partitioning $P$ into $k$ parts (without this term, we could fit each single point with a line, causing a trivial and meaningless optimum solution of cost 0). The goal is to find an optimum partition $\mathcal{P}$ minimizing ${\rm cost}(\mathcal{P})$ .

The dynamic programming approach is the following. First, we renumber the points in $P$ according their $x$ -coordinate (in ascending order). The recursion formula is the following: define ${\rm BestFit}(0) = 0$ , and define ${\rm BestFit}(i)$ to be an optimum partition of the first $i$ points of $P$ . The recursion formula is almost identical to the Bayesian Block partition problem:
${\rm BestFit}(i+1) = \min_{1 \leq j \leq i+1}({\rm fit}(P[j..i+1]) + C + {\rm BestFit}(i))$ , for $0 \leq i \leq n-1$ , where ${\rm fit}(P[j..i+1])$ is the optimal cost to fit the points in $P[j..i+1]$ with a single line (we are slightly abusing the notation here).

The time complexity is $O(n^2)$ , following the same reasoning. One thing to note is that we need to do some preprocessing on $P$ , so that we can quickly compute ${\rm fit}(P[i..j])$ in $O(1)$ time.

Leave a comment

August 31, 2013

Dynamic Programming Practice (4)

Shuffle of strings A shuffle of two strings $X$ and $Y$ is formed by interspersing the characters into a new string, keeping the characters of $X$ and $Y$ in the same order.

Example: the string “prodgyrnamammiincg” is a shuffle of “dynamic” and “programming”.

Given three strings $S_1[1..m]$ , $S_2[1..n]$ and $S[1..n+m]$ , describe and analyze an algorithm to determine whether $C$ is a shuffle of $A$ and $B$ .

This problem is from these lecture notes.

This is a dynamic programming problem. The recursion formula is straightforward: define ${\rm Shuffle}(i, j)$ to be true if $S[1..i+j]$ is a shuffle of $S_1[1..i]$ and $S_2[i..j]$ , and false otherwise, for $0\leq i \leq m$ and $0 \leq j \leq n$ .

(1) ${\rm Shuffle}(0,0) = {\rm true}$ . (An empty string is a shuffle of two empty strings).
(2) If $j =0$ , then ${\rm Shuffle}(i,0)$ is true if and only if the string $S_1[1..i]$ is a prefix of $S[1..n+m]$ , for $1 \leq i \leq m$ .
(3) If $i = 0$ , then ${\rm Shuffle}(0,j)$ is true if and only if the string $S_2[1..j]$ is a prefix of the string $S$ , for $1 \leq j \leq n$ .
(4) Now assume that both $S_1[1..i]$ and $S_2[1..j]$ are non-empty strings, then:

If $S_1[i] \neq S_2[j]$ , then ${\rm Shuffle}(i,j)$ equals to ${\rm Shuffle}(i-1, j)$ if $S[i+j] = S_1[i]$ , and ${\rm Shuffle}(i, j-1)$ if $S[i+j] = S_2[j]$ ; otherwise it is false.
If $S_1[i] = S_2[j]$ , then ${\rm Shuffle}(i,j) = {\rm Shuffle}(i-1, j) || {\rm Shuffle}(i, j-1)$ .

${\rm Shuffle}(m,n)$ is the final answer.

The time complexity of the algorithm is $O(mn)$ , the space complexity is $O(\min\{m,n\})$ . Here is the result of the execution of the algorithm on the example ${\rm Shuffle}("{\rm prodgyrnamammiincg}", "{\rm dynamic}", "{\rm programming}")$ .

Note that we do not need to explicitly maintain the whole table in the implementation: we fill up the table column by column from left to right, and for each column, the entries are filled from north to south. We only need to maintain one column of the table (the “sliding window” method).

One thing to note in this table is that most of the entries seem to be false, so is there any linear time (that is, $O(n+m)$ ) algorithm for this problem?

Leave a comment

August 28, 2013

A probability puzzle: loaded dice

I came across this question on interview street:

Random number generator
There is an ideal random number generator, which given a positive integer M can generate any real number between 0 to M with equal probability.
Suppose we generate 2 numbers x and y via the generator by giving it 2 positive integers A and B, what’s the probability that x + y is less than C? where C is a positive integer.

Input Format
The first line of the input is an integer N, the number of test cases.
N lines follow. Each line contains 3 positive integers A, B and C.
All the integers are no larger than 10000.

Output Format
For each output, output a fraction that indicates the probability. The greatest common divisor of each pair of numerator and denominator should be 1.

Input
3
1 1 1
1 1 2
1 1 3

Output
1/2
1/1
1/1

One way to consider this problem is by generating functions. Suppose the distribution is not uniform, is there any efficient way to compute the probability that the sum of the two randomly generated numbers is no greater than $C$ ?

Suppose for the first random number generator, the probability of that the randomly generated number, $X$ , equals to $i$ , is $p_i$ , for $0 \leq i \leq A$ , and similarly, for the second random generator, the probability that the randomly generated number $Y$ is $i$ is $q_i$ , $0 \leq i \leq B$ Define two polynomials $P(x) = \sum_{j=0}^A p_j x^j$ and $Q(x) = \sum_{j = 0}^B q_j x^j$ . Then consider the multiplication of $P(x)Q(x)$ :
$P(x)Q(x) = \sum_{k = 0}^{A+B}\left(\sum_{i+j = k, 0 \leq i \leq k, 0\leq j \leq k}p_i q_j x^k\right)$ ,
The $k^{\rm th}$ coefficient in the right hand side of the above equation is precisely the probability of $X+Y=k$ . So the probability that $X+Y \leq k$ is the summation of the first $k+1^{\rm th}$ terms. In the problem of interview street, since the distribution is uniform, each $p_i$ is $1/(A+1)$ , and each $q_i$ is $1/(B+1)$ . So we can easily obtain the probability that $X+Y < C$ .

The probability puzzle (using generating functions) is the following:
Loaded dice: No matter what two loaded dice we have, it cannot happen that each of the sums 2,3, …, 12 comes up with the same probability.
(I first came across this puzzle from the book “The art of mathematics”).

We can reason like this: let $p_i$ denotes the probability that the result of the first dice is $i$ , for $1 \leq i \leq 6$ . Similarly, let $q_i$ denotes the probability that the result of the second dice is $i$ , $1 \leq i \leq 6$ . Then the probability that the sum of the two results is $2$ is $p_1 q_1=1/11$ (the first dice and second dice must both output 1). The probability that the sum is 12 is $p_6 q_6=1/11$ (both dices must output 6). Then consider the probability that the sum is 7: it is at least $p_1 q_6 + p_6 q_1$ (the first dice outputs 1, and the second dice outputs 6, or the reverse). But $p_1 q_6 + p_6 q_1 = (1/11) \times (p_1/p_6 + p_6/p_1) \ge 2/11$ .

This reasoning works, but the book provides a much better proof using generating functions. As before, let $P(x) = \sum_{i = 1}^6 p_i x^{i-1}$ and let $Q(x) = \sum_{i=1}^6 q_i x^{i-1}$ . Now if the sums 2,3,…,12 comes with the same probability, 1/11, then we have
$11 P(x)Q(x) = 1+x+x^2+x^3+...+x^{10} = (x^{11}-1)/(x-1)$ . Now since both $P(x)$ and $Q(x)$ are polynomials of degree 5, $11 P(x)Q(x)$ has real roots; however, the right hand side is a cyclotomic polynomial which does not have any real root. Thus the identity cannot hold.

Here is another application of generating functions for computing probabilities.

Counting heads Given integers $n$ and $k$ , along with $p_1, ..., p_n \in [0,1]$ , you want to determine the probability of obtaining exactly $k$ heads when $n$ biased coins are tossed independently at random, where $p_i$ is the probability that the $i^{\rm th}$ coin comes up head. Assume that you can multiply and add two numbers in $[0,1]$ in $O(1)$ time.

(This problem is from this book.)

This problem can be solved using dynamic programming: Let $X_i$ be the indicator random variable for the event that the result of tossing $i^{\rm}$ coin is head, for $1 \leq i \leq n$ . Notice that
${\rm Pr}(\sum_{i=1}^n X_i = k) = {\rm Pr}(\sum_{i = 1}^{n-1} X_i = k-1 {\rm AND } X_n = 1) + {\rm Pr}(\sum_{i=1}^{n-1} X_i = k {\rm AND } X_n =0) = p_n {\rm Pr}(\sum_{i=1}^{n-1} X_i = k-1) + q_n {\rm Pr}(\sum_{i=1}^{n-1} X_i = k)$ .

The time complexity is $O(n^2)$ . However, if we consider the approach generating functions, we can get a faster algorithm using fast fourier transform:
Let $P_i(x) = q_i + p_i x$ , for $1 \leq i \leq n$ . Then consider the product of the $n$ polynomials of degree 1: $\prod_{i = 1}^n P_i(x) = \sum_{j = 0}^n c_j x^j$ . The probability of obtaining $k$ heads is the coefficient $c_k$ . Therefore, the problem is reduced to how fast we can multiply polynomials. Using fast fourier transform, we can multiply two polynomials of degree at most $n$ in time $O(n log n)$ . Hence, we can use divide-and-conquer to multiply these $n$ polynomials:
$\prod_{j=1}^n P_i(x) = \left(\prod_{j=1}^{n/2} P_i(x)\right) \times \left(\prod_{j=n/2+1}^n P_i(x)\right)$ .
Let $T(n)$ denotes the time to multiply $n$ polynomials of degree 1. Then we have the following recursion:
$T(n) = 2 T(n/2) + O(n log n)$ ,
as in the “conquer” step, we are multiplying two polynomials of degree $n/2$ . Hence $T(n) = O(n (log n)^2)$ .

Leave a comment

August 28, 2013

Coin Exchange: Dynamic Programming Practice (3)

Coin Exchange: Given a set of $m$ denominations ${d_0, \cdots, d_{m-1}}$ , and a target value $k$ , find the smallest number of denominations whose sum equals to $k$ . For example, if the denominations are $\{1,4,7,13,28,52,91,365\}$ , and the target is 122, then the optimum way to pick the denominations is $3 \times 1 + 28 + 91$ , so the number of bills we need is 5.

This is a very common dynamic programming problem. To simply things a bit, we assume that the smallest denomination is 1, which ensures that for any target value, there is at least one way to get that target (a sequence of ${\rm target}$ 1’s). W.l.o.g., assume that $d_1 < d_2 < \cdots d_{m-1}$ . The recursion formula is the following: define ${\rm MinBill}(x, i)$ to be the smallest number of bills we need to get $x$ , with the constraint that the largest denomination we can use is $d_i$ , for $0 \leq i \leq m-1$ . The base cases are: ${\rm MinBill}(x, 0) = x$ , for $x \in \mathbb{N}$ ; ${\rm MinBill}(x, i) = 0$ for $x \leq 0$ . For $x \geq 1$ and $1 \leq i \leq m-1$ , if $x < d_i$ , then ${\rm MinBill}(x, i) = {\rm MinBill}(x,i')$ , where $i'$ is the largest index satisfying $d_{i'} \leq x$ ; otherwise, ${\rm MinBill}(x, i) = \min\{1+{\rm MinBill}(x-d_i, i), {\rm MinBill}(x,i-1)\}$ .

Time Complexity/Space complexity: there are several ways to implement the dynamic programming algorithm. If we use the filling-up table approach to implement the algorithm, the time complexity is usually $O(m N)$ , where $m$ is the number of denominations and $N$ is the target value. So it is a psedu-polynomial time complexity algorithm. The space complexity is $O(N)$ . In the code below, I used “recursion + memoization” approach (the reason is given here).

#include <vector>
#include <algorithm>
#include <unordered_map>
#include <iostream>
using namespace std;

size_t hasher(const pair<int, int> &p)
{
    return hash<int>()(p.first) + hash<int>()(p.second);
}

class CoinExchange {
public:
    CoinExchange(const vector<int> &Denominations): Denom {Denominations}, cache(Denom.size(), hasher) {sort(Denom.begin(), Denom.end());}
    int operator()(int k);
    int MinCoin(int target, int largest);
private:
    vector<int> Denom;
    unordered_map<pair<int, int>, int, decltype(hasher) *> cache;
};

int CoinExchange::MinCoin(int target, int largest)
{
    if (target <= 0)
        return {};

    if (!largest)
        return target;

    if (target < Denom[largest]) {
        largest = upper_bound(Denom.begin(), Denom.end(), target) - Denom.begin() - 1;
        return MinCoin(target, largest);
    }

    if (cache.find({target, largest}) != cache.end())
        return cache[{target, largest}];
 return cache[{target, largest}] = min(1 + MinCoin(target - Denom[largest], largest), MinCoin(target, largest - 1));
}

int CoinExchange::operator()(int target)
{
    return MinCoin(target, Denom.size()-1);
}

int main()
{
    vector<int> denominations {1,4,7,13,28,52,91,365};
    int target;
    cin >> target;
    CoinExchange coinexchange {denominations};
    cout << coinexchange(target) << endl;
}

Update:
See here for a very well-explained article for solving the coin exchange problem in $O(1)$ time.

Leave a comment

August 27, 2013

Dynamic Programming Practice (2)

Maximum set of non-intersecting lines Consider two horizontal lines $l_1$ and $l_2$ in the plane. Suppose we are given the $x$ -coordinates of $n$ distinct points $a_1, a_2, \cdots, a_n$ on $l_1$ and $n$ distinct points $b_1, b_2, \cdots, b_n$ on $l_2$ . Design an algorithm to compute the largest set $S$ of non-intersecting line segments satisfying to the following restrictions:

(a) Each segment in $S$ connects some point $a_i$ to the corresponding point $b_i$ .
(b) No two segments in $S$ intersect.

This problem is from the chapter of dynamic programming of these lecture notes.

It is quite easy to derive a recursion formula for this problem: let $S$ denotes $n$ pairs of segments $s_i$ corresponding to $(a_i, b_i)$ , $1 \leq i \leq n$ . Then the maximum set of non-intersecting segments of $P$ is ${\rm OPT}(P) = \max\{{\rm OPT}(P\setminus \{s_1\}), 1+{\rm OPT}(P\setminus {\rm Intersect}(s_1))\}$ , where ${\rm Intersect}(s_i)$ is the set of segments in $P$ intersecting $s_i$ . However, in order to use the dynamic programming approach, it is not clear how the subproblems overlap and in what order should we evaluate them.

An alternative approach is to formulate the problem as a graph problem: each segment $s_i$ correspond to a vertex in the graph, and two vertices are connected by an edge if the segment corresponding to them intersect. Then the problem is equivalent to finding a maximum independent set in the graph. However, only when the graph instance has some special properties, we can compute the maximum independent set in polynomial time easily (via dynamic programming); for example, when the underlying graph is a tree. But the graph representing the intersections of segments clearly does not have such a property.

So how to use dynamic programming to solve this problem? The key observation is the following: suppose we renumber the indices of $\{a_1, \cdots, a_n\}$ , so that they are ordered from left to right, and we also renumber the corresponding indices of $b_i$ ‘s (so when we change the index of some $a_k$ to $a_i$ , we also change $b_k$ to $b_i$ ). Then we looked at the indices of $b_i$ ‘s. The key observation is that a maximum non-intersecting set of segments correspond to the longest increasing subsequence. The renumber step can be done in $O(n log n)$ , and it takes $O(n log n)$ time to find the longest increasing subsequence. So the overall time complexity is $O(n log n)$ .

Leave a comment

August 24, 2013

Largerst k elements in a stream

Given a sequence of numbers, whose length is unknown, find the $k$ largest elements.

This problem is an interesting variant of the largest $k$ elements problem, which is usually phrased as the following: given a sequence of $n$ numbers, and a number $k$ , find the $k$ largest elements. There are two ways to solve this problem: (1) use the selection algorithm to find the element of rank $k$ in linear time, and then do another scan to find all elements no larger than the $k^{\rm th}$ rank element. The time complexity is $O(n)$ , space complexity $O(n)$ . (2) we can reduce the space complexity by maintaining a min heap of size $k$ . For each element after the $k^{\rm th}$ element, if it is smaller than the top element in the heap, discard it, otherwise, pop out the current smallest element in the heap, and push this new element into the heap. The time complexity is $O(n log k)$ , and the space complexity is $O(k)$ .

How to achieve the time complexity $O(n)$ while still using $O(k)$ space? The solution is the following: use an array of size $2k - 1$ . If the array is not filled up, add new element into the array. If the array is filled up, use the median selection algorithm to find the median of current $2k-1$ element, and discard the $k-1$ elements which are greater than the median. This takes $O(k)$ time. The array will be filled up $O(n/k)$ times, so the time complexity is $O(n/k) \cdot O(k) = O(n)$ .

I came across this solution from Algorithms Notes.

Leave a comment

August 21, 2013

Breaking a string into words

You are given a string of $n$ characters $s[1..n]$ , which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like “itwasthebestoftime…”). You wish to reconstruct the document using a dictionary, which is available in the form of an Boolean function $dict(\cdot)$ : for any string $w$ , $dict(w)$ returns true if $w$ is a valid word, and false otherwise.
Given a dynamic programming algorithm that determines whether the string $s[\cdot]$ can be reconstituted as a sequence of valid words. The running time should be at most $O(n^2)$ , assuming calls to $dict(\cdot)$ takes unit time.

The question is an exercise from “Algorithms” by S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani. It is a typical dynamic programming problem.

The recursion is straightforward: define $ValidSentence(S[0..n))$ to be a function over strings, where $ValidSentence(S[0..0)) = {\rm true}$ (the empty string is considered as a valid word in dictionary, and thus a valid sentence vacuously); for $n \geq 1$ ,
$ValidSentence(S[0..n)) = {\rm OR}_{1 \leq i \leq n} (dict(S[0..i)) \&\& ValidSentence(S[i..n))$ .
(The string $S[0..n)$ can only be broken into valid words if and only if there exists some index $j > 0$ , such that the string $S[0..j)$ is a word in the dictionary, and the remaining string can be broken into a sequence of valid words). We can slightly simplify the notation by dropping the end index of the string:

$ValidSentence(n) = {\rm true}$
$ValidSentence(i) = {\rm OR}_{i+1 \leq j \leq n} (dict(i..j) \&\& ValidSentence(j))$ for $0 \leq i \leq n-1$

Then we only need to compute $ValidSentence(0)$ .

The code below is “recursion + memoization” (note that “memoization” should not be confused with “memorization”). In the past I have always thought that “recursion + memoization” is roughly equivalent to iteration (since it is also very straightforward to implement the recursion in an iterative fashion). I have always slightly preferred the iterative fashion over recursion calls, since the latter involve extra bookings such as pushing parameters on the stack and returning value, etc., though it is indeed much easier to code, and mathematically more concise than the iterative form. But the authors in the book point out an interesting situation where recursion + memoization might be better:

In some cases, though, memoization pays off. Here’s why: dynamic programming automatically solves every subproblem that could conceivably be needed, while memoization only ends up solving the ones that are actually used. …..The memoization recursive algorithm will never look at these extraneous table entries.

bool validsentence(const string &s, size_t i, unordered_map<size_t, bool> &cache, const Dictionary &dictionary)
{
if (i == s.size())
return true;

if (cache.find(i) != cache.end())
return cache[i];

bool res {};
for (size_t j = i+1; j <= s.size(); ++j)
res = res || dictionary.isword(string(s.cbegin() + i, s.cbegin() + j)) && validsentence(s, j, cache, dictionary);
cache[i] = res;
return res;
}

bool validsentence(const string &s, const Dictionary &dict)
{
unordered_map<size_t, bool> cache;
return validsentence(s, 0, cache, dict);
}

Leave a comment

August 19, 2013

Longest subarray whose sum <= k

Longest subarray whose sum $\leq k$
Given an array $A$ of $n$ numbers and a key $k$ , find the longest subarray of $A$ for which the subarray sum is less than or equal to $k$ .

This problem is from the book “Elements of Programming Interviews”.

This problem has a rather easy quadratic time complexity algorithm: check each pair of indices $i < j$ , where $0 \leq i < j \leq n-1$ to see if the sum of the subarray $A[i..j]$ is no greater than $k$ , and return the pair with the largest difference $j-i+1$ . To avoid repeatedly computing the subarray sum, one can first preprocess the array to get an array $S$ of partial sums, where $S[i] = \sum_{j = 0}^{i-1} A[j]$ , for $i = 1,\cdots, n$ , and by definition, $S[0] = 0$ . Then the subarray sum can be obtained as $A[i..j] = S[j+1] - S[i]$ . The preprocessing takes linear time and linear space.

In the following we describe an algorithm with time complexity $O(n \log n)$ .

Algorithm LONGESTSUBARRAY( $A[0..n-1]$ , $k$ )
Input: An array $A[0..n-1]$ and a real number $k$
Output: A pair of indices $i, j$ , maximizing $j - i + 1$ , subject to $\sum_{t = i}^j A[t] \leq k$
1. Compute the array $S[0..n]$ of partial sums, where $S[i]$ is the prefix sum of the first $i$ numbers in $A$ . $S[0] \leftarrow 0$ .
2. $(l, r) \leftarrow (0,0)$
3. $M \leftarrow \emptyset$
4. $smallest \leftarrow +\infty$
5. for $i \leftarrow n$ down to 0
    if $S[i] < {\rm smallest}$
         $M \leftarrow M \cup \{(S[i], i)\}$
         ${\rm smallest} \leftarrow S[i]$
6. ${\rm current} \leftarrow -\infty$
7. for $i \leftarrow 0$ up to $n-1$
    if $S[i] > {\rm current}$
        Use binary search to look for the rightmost element $(S[t], t)$ in $M$ such that $S[t] - S[i] \leq k$ . If such an element exists, update $(l, r)$ to $(i, t)$ if $t-i \geq r - l$
         ${\rm current} \leftarrow S[i]$
8. return $(l,r)$

Correctness of the Algorithm
The key is to observe the following two facts.
Claim: (1) If two indices $r < r'$ satisfies that $S[r] \geq S[r']$ , then $r$ cannot appear in an optimum solution; (2) If two indices $l < l'$ satisfies that $S[l] \geq S[l']$ , then $l'$ cannot appear in an optimum solution.
Proof: For any index $i$ , $0 \leq i \leq n$ , note that the subarray sum of $A[i..r']$ is less than the subarray sum of $A[i..r]$ , and the length of the subarray $A[i..r’]$ is greater than the length of the subarray $A[i..r]$ (although both length might be negative, but the statement still holds). Therefore, $(i,r')$ is preferable over $(i, r)$ . This is also the reason why we compute the strictly increasing sequence of $M$ .
The second statement follows a similar reasoning.

Time Complexity and Space Complexity
Step 1, computing the prefix sums, takes $O(n)$ time. Step 4, computing the array $M$ , takes $O(n)$ time. Step 7 takes $O(n \log n)$ time as each iteration takes $O(\log n)$ time.

Leave a comment

August 16, 2013

Task Scheduling and Balanced Binary Search Tree

The Task Scheduling problem is from Hackerrank.

Task Scheduling
Given a list of $n$ tasks $\{(d_i, m_i)\}$ , where the $i^{\rm th}$ task has deadline $d_i$ and duration $m_i$ , for each $i$ , $i = 1,\cdots, n$ , compute the minimum maximum lateness of the first $i$ tasks.

First we introduce some notations which will be used in the following discussion. Given a set $n$ tasks, a schedule of the tasks is a sequence $t_{i_1}, t_{i_2},\cdots, t_{i_n}$ , where each $i_j$ is in $[n]$ . In other words, the sequence, $i_1, \cdots, i_n$ is a permutation of $[n]$ . For example, a schedule for $t_1, t_2, t_3$ , could be $t_2, t_3, t_1$ (we first process task $t_2$ , then $t_3$ , lastly, $t_1$ ). For a schedule $t_1, \cdots, t_n$ , the start time for $t_i$ , $s_i$ , is $\sum_{j = 1}^{i-1} m_j$ (the summation of durations of all the tasks before task $t_i$ in the schedule), and the finish time $f_i$ is $s_i + m_i$ (the start time for $t_i$ plus the time needed to process $t_i$ ). The lateness of $t_i$ , $l_i$ for $t_i$ in this schedule is the difference between its finish time and its deadline, $f_i - d_i$ . Here is an example (the input is from Hackerrank). Suppose we have 5 tasks, $\{(2,2), (1,1), (4,3), (10,1), (2,1)\}$ . The first task $t_1$ has a deadline at 2, and we need 2 time units to process $t_1$ ; similarly, the second task $t_2$ has a deadline at 1, and its duration is 1, and so on. A valid schedule for these 5 tasks might be, $t_2, t_1, t_5, t_3, t_4$ . The start time $s_2$ , for $t_2$ is 0, and the finish time $f_2$ is 1, the lateness $l_2$ is $f_2 - d_2 = 1-1 = 0$ ; the start time $s_1$ for task $t_1$ is 1 (which is the finish time of the task just precedes it), and the finish time $f_1$ is $s_1 + m_1 = 1 + 2 = 3$ , so it has lateness $f_1 - d_1 = 3 - 2 =1$ . Similarly, we can compute the start time/finish time/lateness for the rest of the tasks: $s_5 = 3$ , $f_5 = 4$ , $l_5 = 2$ ; $s_3 = 4$ , $f_3 = 7$ , $l_3 = 3$ ; $s_4 = 7$ , $f_4 = 8$ , $f_4 = -2$ . So the maximum lateness, $\max_{1 \leq i \leq 5} l_i$ is 3, which is the lateness of $t_3$ .

A Greedy Algorithm to Process a Set of Tasks
If we are given $n$ tasks, and our goal is to find an optimum schedule for the $n$ tasks, we can solve this problem using the greedy algorithm: sort the tasks by deadlines in non-decreasing order, and that order is the schedule minimizing the lateness. The time complexity is $O(n log n)$ .

The algorithm, which we will refer to ${\rm GREEDYSCHEDULE}$ , is the following:

$\underline {\rm GREEDYSCHEDULE}$
Input: a set of $n$ tasks, each task $t_i$ is specified by $(d_i, m_i)$
Output: a schedule $t_{i_1}, t_{i_2}, \cdots, t_{i_n}$ , where $i_j \in [n]$ , and the maximum lateness $l^{\ast} = \max_{1 \leq i \leq n} l_i$ .
1. Sort the tasks by the deadlines, and renumbering these tasks. Let $\mathcal{T}$ be the new sequence of tasks.
2. $f \leftarrow 0$
3. $l^{\ast} \leftarrow 0$
4. for each $t_i$ in $\mathcal{T}$
$f \leftarrow f + m_i$
if ( $f - d_i > l^{\ast}$ )
$l^{\ast} \leftarrow f - d_i$
5. Output $l^{\ast}$ and $\mathcal{T}$

The time complexity of the above algorithm is $O(n log n)$ , as the first step, sorting, takes $O(n log n)$ . The rest takes $O(n)$ .

An Incremental Algorithm to Process $n$ sets of tasks
Let us now consider the Task Scheduling problem from Hackerrank. Here we need to compute the maximum lateness for $n$ sets of tasks, that is, the first $i$ tasks, for each $i =1, \cdots, n$ . For notational convenience, in the following, we use $T_i$ to denote the first $i$ tasks. Note that $T_1 \subset T_2 \subset \cdots T_n$ . The possible solutions might be:

The most naive method. Repeat ${\rm GREEDYSCHEDULE}$ for each $T_i$ , for $i = 1,\cdots, n$ . The time complexity is $\sum_{i = 1}^n O(i log i)$ , which is $O(n^2 log n)$ .
The relatively less naive method. Note that we don’t need to repeatedly do the sorting step: once we have the optimum schedule for $T_i$ , we can use binary search to find the position to insert $t_{i+1}$ , which takes $O(log i)$ . But we still have to update the lateness. How can we do this? Since all the tasks after the newly inserted task is delayed further by $m_{i+1}$ time units, their lateness might increase. So if we scan all $i+1$ tasks to find the new maximum lateness, this would take $O(i)$ time. So the overall time complexity is $\sum_{i = 1}^n (O(log i) + O(i))$ , which is $O(n log n) + O(n^2)$ , which is $O(n^2)$ .

In the following, we describe an algorithm that takes $O(n log n)$ time and $O(n)$ space. Note that the main difficulty is to update the lateness, which in turn involves two subproblems: (1) compute the lateness of the newly inserted task $t_{i + 1}$ , so that we can compute its latesness $l_{i+1}$ , and (2) compute the maximum lateness of all the tasks that are pushed $m_{i+1}$ time units further to the right along the time line, because of the insertion of $t_{i + 1}$ .

Let us first consider the first subproblem, which can be solved in logarithmic time in many ways. The easiest method to think of, is probably to use a balanced binary search tree (but note that we will use a better approach — a simpler one, although with the same time complexity). Given a list of tasks, we can build a vanilla balanced binary search tree $\mathcal{B}$ of the tasks, where the key is the deadline (this would allow us to find the correct position for the new task $t_{i + 1}$ in $O(\log i)$ time, as the tree height of a balanced binary search tree with $n$ leaves is $O(\log n)$ ). We can augment $\mathcal{B}$ in the following way: store in each node not only $d_i$ and $m_i$ , but also the summation of the durations $m_i$ of all its descendents. This is the time that is needed to process all tasks in the subtree rooted at this node. So each node $v$ in $\mathcal{B}$ has three values, $v.deadline$ , $v.duration$ and $v.overallduation$ . When we insert a new task $t_{i+1}$ into $\mathcal{B}$ , we can update this field of all ancestors of $t_{i+1}$ and compute the finish time of $t_{i+1}$ by the time we reach the bottom of $\mathcal{B}$ simultaneously: initialize $f$ with 0. We will accumulate the time needed to process all tasks before $t_{i+1}$ in $f$ . Start from the root of $\mathcal{B}$ , for each node $v$ we touched, add $m_{i+1}$ to $v.overallduration$ . If the deadline of the new task, $d_{i+1}$ is greater than or equal to $v.deadline$ , descend to the right, and add the $overallduration$ of the left child of $v$ and $v.duration$ to $f$ , as the tasks in the left subtree of $v$ , and $v$ itself, are processed before $t_{i+1}$ in the schedule; otherwise, $v.deadline$ is less than $d_{i+1}$ , in this case we descend to the left child, and do not add any value to $f$ . When we reached the bottom of $\mathcal{B}$ , we insert a new node $u$ , for $t_{i+1}$ , $u.deadline = d_{i+1}$ , $u.duration = m_{i+1}$ , and $u.overallduration = m_{i+1}$ (as $u$ has no descendent other than itself). $f$ is now the partial sum $\sum m_i$ , where the summation is over all the predecessors of $u$ , in other words, all the tasks processed before $t_{i+1}$ in the current schedule. With the finish time $f$ for $t_{i+1}$ , we can compute the lateness of $t_{i+1}$ in the current schedule as $l_{i+1} = f-d_{i+1}$ . The time complexity is linear in the height of $\mathcal{B}$ , as we only spend $O(1)$ time per node as we walk down the tree. Since $\mathcal{B}$ is balanced, the time complexity is thus $O(\log i)$ .

Let us now consider the second subproblem. How can we compute the maximum lateness of current schedule? Suppose the maximum lateness occurs after $t_{i+1}$ , then we can easily get the new maximum lateness: consider any task in $T_i$ . If this task is before $t_{i+1}$ in the new schedule, then the insertion of $t_{i+1}$ has no effect on its lateness. The insertion of $t_{i+1}$ causes a shift for all the tasks in $T_i$ preceded by $t_{i+1}$ by $m_{i+1}$ time units. Therefore, their maximum lateness, $l^{\ast}_i$ , is now increased by $m_{i+1}$ . Hence, the new maximum lateness is $\max\{l^{\ast}_i + m_{i+1}, l_{i+1}\}$ . What about the case where the maximum lateness occurs before $t_{i+1}$ ? Because the lateness of all the tasks preceded by $t_{i+1}$ has increased, it is possible that one of them assumes the maximum lateness. In the worst case, if $t_{i+1}$ is inserted to the very front of the schedule, there would be $i$ tasks after $t_{i+1}$ , even if we do not have to recompute the lateness of each of them, it is quite inefficient to get the maximum lateness by scanning each of them.

There are two optimizations we could make here. The first one is an observation of the lateness, the other is the use of Fenwick Tree. Let us first consider the following question: in a schedule of $m$ tasks, $t_1, \cdots, t_m$ , ordered by their deadlines, which of these tasks might assume the maximum lateness after an insertion of a new task? Suppose we have a sequence of lateness $l_1, \cdots, l_m$ , where $l_i$ is the lateness of task $t_i$ in the schedule. We make the following claim:

Claim: Consider a pair of tasks $t_i$ and $t_j$ , where $i < j$ (that is, task $t_i$ is precedes task $t_j$ in the schedule). If $l_i \leq l_j$ , then task $t_i$ can never have a larger lateness that task $t_j$ .

Proof: The reason is that (1) if the new task is inserted after $t_j$ , both $t_i$ and $t_j$ are not affected; their lateness is the same after the insertion; (2) if the new task is inserted after $t_i$ and before $t_j$ , the lateness of $t_i$ , $l_i$ remains the same, while $l_j$ increased; (3) if the new task is inserted before $t_i$ , then both $t_i$ and $t_j$ is shifted further to the right, by an amount of $m$ — the duration of the new task, hence if $l_j$ is no smaller than $l_i$ , it will remain so after the insertion.

This observation allows us to store a sub-sequence of the lateness sequence $\{l_1, \cdots, l_m\}$ , which is $\{l_{i_1}, l_{i_2}, \cdots, l_{i_k}\}$ , where $l_{i_1} > l_{i_2} >\cdots>l_{i_k}$ and $i_1, \cdots, i_k$ is a sub-sequence of $1,\cdots, m$ . Let us denote this list of lateness as $\mathcal{L}$ . Now we can compute the maximum lateness after the insertion of the new task: find the position where the lateness of $t_{i+1}$ would occur in $\mathcal{L}$ . Suppose $i_j$ and $i_{j'}$ are the two indices. The maximum lateness is the maximum of the lateness before $t_{i_j}$ (as all tasks before and including $i_j$ precedes task $t_{i+1}$ , so their lateness remain the same after the insertion), the lateness of $t_{i+1}$ , and the maximum lateness of all the tasks after and including $t_{i_{j'}}$ plus $m_{i+1}$ , which is the duration of the new task. Since the elements in $\mathcal{L}$ is in strictly decreasing order, we can easily the get the maximums of the two parts in $\mathcal{L}$ : the very first element of each part. Therefore, determining the maximum lateness is $O(1)$ . However, we need to update the list $\mathcal{L}$ : namely, we need to add a value of $m_{i+1}$ to each entry after and including $l_{i_{j'}}$ , and possibly inserting $l_{i+1}$ (the lateness of the new task) before $l_{i_j'}$ , and remove all the elements in the left part of $\mathcal{L}$ in order to maintain the strictly increasing order in $\mathcal{L}$ . (By the way, even if we don’t optimize further, and just naively update $\mathcal{L}$ , the program already can pass all the tests on Hackerrank).

In the worst case, the size of $\mathcal{L}$ still could be in the order of the number of all the tasks. So the overall time complexity to update $\mathcal{L}$ could be large (for example, if $t_{i+1}$ is inserted before all the tasks in $T_i$ and each task in the optimum schedule of $T_i$ happens to have a decreasing lateness). While practically fast enough, the time complexity is still not $O(n log n)$ yet. To avoid many updates in $\mathcal{L}$ after an insertion, we use a simple trick: instead of explicitly storing the lateness in the list, we store the deadline. That is, if $l_i$ appears in $\mathcal{L}$ , in the same position, we would store $d_i$ , which is the deadline of task $t_i$ . The deadline would serve as an id for each task. Using the deadline, we can find the finish time for the task with deadline $d_i$ and hence compute its lateness in the current schedule in time $O(log i)$ .

Now in an insertion, there might be several queries on the balanced binary search tree $\mathcal{B}$ for the finish times, namely, in the set of tasks precedes $i+1$ in the schedule, which also appears in $\mathcal{L}$ , we have to find those tasks whose lateness are smaller than the current maximum lateness and throw them out from $\mathcal{L}$ . This can be done by scanning the elements in $\mathcal{L}$ reversely from the point where $i+1$ is inserted, till we find a task that precedes $i+1$ with a larger lateness, or the list is exhausted. Since each task can only be thrown away once (if it is evicted from $\mathcal{L}$ , it would never appear in $\mathcal{L}$ ), therefore, the overall costs of all insertions is $O(n \log n)$ .

After Thoughts/Alternatives
Instead of using balanced binary search tree, we could also use Fenwick Tree. Fenwick tree allows one to update an array and query partial sums on the array efficiently: each operation takes $O(\log n)$ time. In some sense, it is equivalent to balanced binary search tree. But it is much easier to implement . Although conceptually not too hard, it still takes much more lines to code an augmented balanced binary search tree than Fenwick Tree. There is a balanced binary search tree in STL, but currently I don’t know any good to augment it easily. In contrast, Fenwick tree can be easily coded using array or hash table (unordered_set in STL). Of course, we could also use a regular augmented binary search tree, without the rotation operations to keep the tree balanced, and rely on the fact that if the keys of the input for constructing the binary search tree is random, then the tree height has expectation $O(\log n)$ .

Leave a comment

Month: August, 2013