Breaking a string into words

by e2718281828459045

You are given a string of $n$ characters $s[1..n]$ , which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like “itwasthebestoftime…”). You wish to reconstruct the document using a dictionary, which is available in the form of an Boolean function $dict(\cdot)$ : for any string $w$ , $dict(w)$ returns true if $w$ is a valid word, and false otherwise.
Given a dynamic programming algorithm that determines whether the string $s[\cdot]$ can be reconstituted as a sequence of valid words. The running time should be at most $O(n^2)$ , assuming calls to $dict(\cdot)$ takes unit time.

The question is an exercise from “Algorithms” by S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani. It is a typical dynamic programming problem.

The recursion is straightforward: define $ValidSentence(S[0..n))$ to be a function over strings, where $ValidSentence(S[0..0)) = {\rm true}$ (the empty string is considered as a valid word in dictionary, and thus a valid sentence vacuously); for $n \geq 1$ ,
$ValidSentence(S[0..n)) = {\rm OR}_{1 \leq i \leq n} (dict(S[0..i)) \&\& ValidSentence(S[i..n))$ .
(The string $S[0..n)$ can only be broken into valid words if and only if there exists some index $j > 0$ , such that the string $S[0..j)$ is a word in the dictionary, and the remaining string can be broken into a sequence of valid words). We can slightly simplify the notation by dropping the end index of the string:

$ValidSentence(n) = {\rm true}$
$ValidSentence(i) = {\rm OR}_{i+1 \leq j \leq n} (dict(i..j) \&\& ValidSentence(j))$ for $0 \leq i \leq n-1$

Then we only need to compute $ValidSentence(0)$ .

The code below is “recursion + memoization” (note that “memoization” should not be confused with “memorization”). In the past I have always thought that “recursion + memoization” is roughly equivalent to iteration (since it is also very straightforward to implement the recursion in an iterative fashion). I have always slightly preferred the iterative fashion over recursion calls, since the latter involve extra bookings such as pushing parameters on the stack and returning value, etc., though it is indeed much easier to code, and mathematically more concise than the iterative form. But the authors in the book point out an interesting situation where recursion + memoization might be better:

In some cases, though, memoization pays off. Here’s why: dynamic programming automatically solves every subproblem that could conceivably be needed, while memoization only ends up solving the ones that are actually used. …..The memoization recursive algorithm will never look at these extraneous table entries.

bool validsentence(const string &s, size_t i, unordered_map<size_t, bool> &cache, const Dictionary &dictionary)
{
if (i == s.size())
return true;

if (cache.find(i) != cache.end())
return cache[i];

bool res {};
for (size_t j = i+1; j <= s.size(); ++j)
res = res || dictionary.isword(string(s.cbegin() + i, s.cbegin() + j)) && validsentence(s, j, cache, dictionary);
cache[i] = res;
return res;
}

bool validsentence(const string &s, const Dictionary &dict)
{
unordered_map<size_t, bool> cache;
return validsentence(s, 0, cache, dict);
}

e2718281828459045

Breaking a string into words

by e2718281828459045

Share this:

Related

Leave a comment Cancel reply