Breaking a string into words

by e2718281828459045

You are given a string of n characters s[1..n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like “itwasthebestoftime…”). You wish to reconstruct the document using a dictionary, which is available in the form of an Boolean function dict(\cdot): for any string w, dict(w) returns true if w is a valid word, and false otherwise.
Given a dynamic programming algorithm that determines whether the string s[\cdot] can be reconstituted as a sequence of valid words. The running time should be at most O(n^2), assuming calls to dict(\cdot) takes unit time.

The question is an exercise from “Algorithms” by S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani. It is a typical dynamic programming problem.

The recursion is straightforward: define ValidSentence(S[0..n)) to be a function over strings, where ValidSentence(S[0..0)) = {\rm true} (the empty string is considered as a valid word in dictionary, and thus a valid sentence vacuously); for n \geq 1,
ValidSentence(S[0..n)) = {\rm OR}_{1 \leq i \leq n} (dict(S[0..i)) \&\& ValidSentence(S[i..n)).
(The string S[0..n) can only be broken into valid words if and only if there exists some index j > 0, such that the string S[0..j) is a word in the dictionary, and the remaining string can be broken into a sequence of valid words). We can slightly simplify the notation by dropping the end index of the string:

ValidSentence(n) = {\rm true}
ValidSentence(i) = {\rm OR}_{i+1 \leq j \leq n} (dict(i..j) \&\& ValidSentence(j)) for 0 \leq i \leq n-1

Then we only need to compute ValidSentence(0).

The code below is “recursion + memoization” (note that “memoization” should not be confused with “memorization”). In the past I have always thought that “recursion + memoization” is roughly equivalent to iteration (since it is also very straightforward to implement the recursion in an iterative fashion). I have always slightly preferred the iterative fashion over recursion calls, since the latter involve extra bookings such as  pushing parameters on the stack and returning value, etc., though it is indeed much easier to code, and mathematically more concise than the iterative form. But the authors in the book point out an interesting situation where recursion + memoization might be better:

In some cases, though, memoization pays off. Here’s why: dynamic programming automatically solves every subproblem that could conceivably be needed, while memoization only ends up solving the ones that are actually used. …..The memoization recursive algorithm will never look at these extraneous table entries.

bool validsentence(const string &s, size_t i, unordered_map<size_t, bool> &cache, const Dictionary &dictionary)
{
if (i == s.size())
return true;

if (cache.find(i) != cache.end())
return cache[i];

bool res {};
for (size_t j = i+1; j <= s.size(); ++j)
res = res || dictionary.isword(string(s.cbegin() + i, s.cbegin() + j)) && validsentence(s, j, cache, dictionary);
cache[i] = res;
return res;
}

bool validsentence(const string &s, const Dictionary &dict)
{
unordered_map<size_t, bool> cache;
return validsentence(s, 0, cache, dict);
}