goodmike blogs here

The DC Clojure Study Group is digging into Clojure. One of our members, Luke VanderHart, contributed a program that parses a text file to build a Markov Chain generator: it will produce an arbitrarily long text that should mimic the style of the file you gave it by basing its output on the proximity of words or characters in the file. Cool stuff.

What I’m posting here is a close reading and interpretation of Luke’s code. I’d like to understand it myself and elucidate it for others, and I want to identify ways to add future lessons about Clojure into its code. For example, how could, or should, concurrency be used to increase the efficiency or modularity of the creation of the index at the heart of the generator? As it stands now, the code makes excellent use of lazy sequences and Java interop for reading the file. So let’s roll up our sleeves and dig in.

The Code

First of all, Luke’s code is available from our Google group: http://tinyurl.com/markov. Go ahead and get it. I’m not going to reproduce it all here, since this post will be long enough as it is. The code is what he posted and updated on December 14th, 2008.

Luke has done us a big favor by making his code quite readable. The program is divided into sections: File IO, index generation, chain generation, and execution examples. Furthermore, definitions of values and functions are well-commented.

First, the program defines two sets of characters, word-break-chars and semantic-punct-chars, that will help define where words in the “corpus,” or the text file you supply to the program, begin and end: the text will be split into words at any of the word-break-chars, whereas chars in semantic-punct-chars will be considered separate words themselves.

Next, two functions that produce sequences from the corpus are defined. I’m only going to focus on word-seq. It takes a Java reader as an argument and recursively takes characters from the reader until it identifies a complete word or the reader signals it is finished with a -1. Notice that word-seq is lazy. It supplies one word at a time. It does not read to the end of the reader’s stream unless asked to.

Take a look at piece-file, a convenience function that takes a file name, to use as a corpus, and a sequencing function (e.g. word-seq):

(defn piece-file [fname piece-function]
  "returns a file as a lazy seq of pieces, using the supplied piecing function"
  (piece-function (java.io.BufferedReader. (java.io.FileReader. fname))))

You’ve probably seen this before, but it’s worth noting how easily Java interop is accomplished in Clojure. A new FileReader, taking the file name as the argument to its constructor, is instantiated and passed as an argument to the BufferedReader constructor, and presto — you’ve got a BufferedReader ready to go. Specifically, it’s ready to be passed to the sequencing function as a reader.

A sequencing example

Here’s what the product of piece-file looks like when used with word-seq and a file containing a certain amount of the beginning of The Adventures of Huckleberry Finn, available from Project Gutenberg.

> (def corpus (piece-file "/arbitrary_path/huck-finn.txt" word-seq))
#'user/corpus
> (take 21 corpus)
("YOU" "don't" "know" "about" "me" "without" "you" "have" "read" "a"
"book" "by" "the" "name" "of" "The" "Adventures" "of" "Tom" "Sawyer" \;)

Indexing

The next section of Luke’s code contains a trio of indexing functions that produce a hash-map: the keys are word sequences found in the corpus, and the values are a set of following-word occurrences: each occurrence is a map of a word that was seen following the sequence to a number representing the number of times the word was seen following the sequence.

That last bit was a mouthful, so let me show you an example using my Huck Finn corpus:

> (def indx (build-index corpus 2 1))

Here is the “signature” of build-index:

(defn build-index [full-corpus key-length completion-length] ...)

It takes a corpus, called full-corpus to distinguish the complete corpus from the steadily-shrinking portion of the corpus on which an internal loop will recur. The key-length and completion-length parameters are integers that determine how many words will be grouped into each key of the index and how many words will be grouped into “completions”, that is, sequences that follow the key in the corpus. In my example, and by default in Luke’s program, the keys will be two words and the completions one word.

So build-index is instructed to read the sequence of words from my Huck Finn file, reading two words at a time and then noting the word, aka the completion, that follows those two.

If build-index has seen this two-word sequence before (i.e. there is a key matching the sequence in the index hash-map), it checks whether it has already seen the following word after the two-word sequence.
If so, it increments the integer value in the occurrence that matches the word.
Otherwise, it adds an occurrence: a map of the word to the number 1, for one occurrence
If build-index has not seen this two-word sequence before, it adds a new key-value pair in the index with the two-word sequence as a key and a new set with one occurrence map.
Then the process begins again on Huck Finn minus the word at the “front”. This is what (recur index (nthrest corpus completion-length)) does in this case.
When the corpus has been processed, the index is returned.

Here’s what I get from having build-index work on my Huck Finn snippet, with some formatting added into my REPL output to make it more readable:

> (def indx (build-index corpus 2 1))
#'user/indx
> (take 5 indx)
([("only" "everything") {("was") 2}]
 [("was" "no") {("kin") 1}]
 [("over" "the") {("victuals") 2}]
 [("was" "all") {("right") 1}]
 [("I" "was") {("glad") 1, ("there") 1, ("fidgety") 1, ("dead") 1, ("in") 1}])

This tells us that the index has captured certain information about my Huck Finn snippet — for example, the word “was” follows “only everything” twice, and the sequence “I was” is followed by “glad”, “there”, “fidgety”, “dead”, and “in”, each one time.

If you’d like to poke around with the index, you can take advantage of Clojure’s gorgeous treatment of maps as functions: pass a key to a map as though the map were an operator, and you get the value, if any:

> (take 5 (keys indx))
(("only" "everything") ("was" "no") ("over" "the") ("was" "all") ("I" "was"))

> (indx '("I" "was"))
{("glad") 1, ("there") 1, ("fidgety") 1, ("dead") 1, ("in") 1}

> (keys (indx '("I" "was")))
(("glad") ("there") ("fidgety") ("dead") ("in"))

Generation

Now we want to take our index and generate some eerily Huck-Finn-like text with it. This is where Markov comes in. As the wikipedia article about Markov Chains explains, there are situations where the a future state can be predicted, in this case by probabilities, solely from a present state—to wit, the to-be-generated text’s contents can be predicted from the index of the corpus. We are building the new text solely out of pieces of Huck Finn.

To implement the probability part of the generator, we have markov-next, which, as Luke succinctly explains in his documentation, “Given an index and a key, picks a statistically consistent completion”. I want to look at how it does that.

(let [completions (index key)
      probabilities (vals completions)
      total-probabilities (reduce + probabilities)
      random (inc (rand-int total-probabilities))] ...

The index is our index of sequences and occurrences of completions in Huck Finn. The key is one of those sequences, maybe ‘(“I” “was”). In that case, here is how the let‘s bindings evaluate:


completions (index key)           ;=> {("glad") 1, ("there") 1, ("fidgety") 1, 
                                                          ("dead") 1, ("in") 1}
probabilities (vals completions)  ;=> (1 1 1 1 1)
total-probabilities (reduce + probabilities)  ;=> 5
random (inc (rand-int total-probabilities))   ;=> a number between 1 and 5, inclusive

After a defensive check to make sure completions isn’t somehow nil, the markov-next function enters this loop:


(loop [current 0 remaining (seq completions)]
  (let [next (+ current (last (first remaining)))]
    (if (<= current random next)
      (ffirst remaining)
      (recur next (rest remaining)))))

The loop simply counts up from zero, determining the next number by adding the frequency number of each completion, next, to its counter current, and checks to see whether the number random is now between, or equal to either of, the counter and the counter plus next.

If you’ve never seen the <= operator used with more than two arguments—I hadn’t—try doing what I did: use doc at the REPL to get the documentation for Clojure’s <= operator:

> (doc <=)
-------------------------
clojure.core/<=
([x] [x y] [x y & more])
  Returns non-nil if nums are in monotonically non-decreasing order,
  otherwise false.

That’s the Lisp world for you. Somewhere in 50 years of hacking, somebody decided <= shouldn’t just compare two numbers but rather the order of a whole sequence of numbers. Bless their heart.

So the loop recurs until it can put its random number in the range defined by the frequency of the completion: the more frequent the completion, the more likely that the loop will choose it. When that happens, ffirst is applied to the remaining completions, with the chosen one at the front. (ffirst x) is an abbreviated form for (first (first x)). So, if random came out to 3, and so remaining has been recursively set to ([("fidgety") 1] [("dead") 1] [("in") 1]), then ffirst will give us “fidgety”, a fine Twainian word, if not quite as good as “fantods”.

The markov-chain function calls markov-next recursively to build our new text one completion at a time. It too is lazy, so it will only build as long a text as you ask it to. get-random-key chooses a starting key for the process.

Ta da!

Let’s try it out:

> (def chain (markov-chain indx (get-random-key indx)))
> (println (apply str (interpose \space (take 50 chain))))
was rough living in the house all the year round --more than a body could tell 
what to do with . The Widow Douglas is all told about in that book , which was
no kin to her , and it fetched us a dollar a day apiece all the

That’s pretty good.

Tweaks and future directions

A get-random-start-word function might ensure that the first random key begins with a capital letter:

(defn get-random-start-word [index]
  "Determines a random word, that must begin with a capital letter, 
   from the supplied index with which to start generation"
  (loop [random-key (get-random-key index)]
    (if (re-seq #"[A-Z]" (str (first random-key)))
      random-key
      (recur (get-random-key index)))))

> (get-random-start-word indx)
("I" "warn't")
> (get-random-start-word indx)
("She" "put")

> (def chain (markov-chain index (get-random-start-word indx)))
> (println (apply str (interpose \space (take 50 chain))))
Mark Twain , and you had to wait for the whole world ; 
she was going to start a band of robbers , and I was dead . ...

This of course assumes that there is at least one capitalized word in the corpus.

Of course, it would be fun to put this on the web, behind a form for uploading texts or linking to online texts. A Clojure program should be able to create and wrap a servlet, and there’s some buzz about a web library for Clojure called compojure.

Concurrency

It might be interesting to look at introducing some concurrency into the generator. The index hash-map gets passed around and operated on a lot. Would it be beneficial to make a ref to it? Then, concurrent indexing or generating would become possible. If we wanted to have index keys and values with more than one word length, index could be built up roughly in parallel, with one process parsing and adding one-word keys and values, another handling two-word keys, etc. Is this the right way to think about concurrency? If anyone has some ideas, I’d love to hear them.

This is Michael Harrison's old blog. Recent posts are available.

On Clojure: A close look at Luke VanderHart’s Markov Chain generator