The DC Clojure Study Group is digging into Clojure. One of our members, Luke VanderHart, contributed a program that parses a text file to build a Markov Chain generator: it will produce an arbitrarily long text that should mimic the style of the file you gave it by basing its output on the proximity of words or characters in the file. Cool stuff.
What I’m posting here is a close reading and interpretation of Luke’s code. I’d like to understand it myself and elucidate it for others, and I want to identify ways to add future lessons about Clojure into its code. For example, how could, or should, concurrency be used to increase the efficiency or modularity of the creation of the index at the heart of the generator? As it stands now, the code makes excellent use of lazy sequences and Java interop for reading the file. So let’s roll up our sleeves and dig in.
The Code
First of all, Luke’s code is available from our Google group: http://tinyurl.com/markov. Go ahead and get it. I’m not going to reproduce it all here, since this post will be long enough as it is. The code is what he posted and updated on December 14th, 2008.
Luke has done us a big favor by making his code quite readable. The program is divided into sections: File IO, index generation, chain generation, and execution examples. Furthermore, definitions of values and functions are well-commented.
First, the program defines two sets of characters, word-break-chars
and semantic-punct-chars
, that will help define where words in the “corpus,” or the text file you supply to the program, begin and end: the text will be split into words at any of the word-break-chars, whereas chars in semantic-punct-chars will be considered separate words themselves.
Next, two functions that produce sequences from the corpus are defined. I’m only going to focus on word-seq
. It takes a Java reader as an argument and recursively takes characters from the reader until it identifies a complete word or the reader signals it is finished with a -1. Notice that word-seq
is lazy. It supplies one word at a time. It does not read to the end of the reader’s stream unless asked to.
Take a look at piece-file
, a convenience function that takes a file name, to use as a corpus, and a sequencing function (e.g. word-seq
):
(defn piece-file [fname piece-function] "returns a file as a lazy seq of pieces, using the supplied piecing function" (piece-function (java.io.BufferedReader. (java.io.FileReader. fname))))
You’ve probably seen this before, but it’s worth noting how easily Java interop is accomplished in Clojure. A new FileReader, taking the file name as the argument to its constructor, is instantiated and passed as an argument to the BufferedReader constructor, and presto — you’ve got a BufferedReader ready to go. Specifically, it’s ready to be passed to the sequencing function as a reader.
A sequencing example
Here’s what the product of piece-file
looks like when used with word-seq
and a file containing a certain amount of the beginning of The Adventures of Huckleberry Finn, available from Project Gutenberg.
> (def corpus (piece-file "/arbitrary_path/huck-finn.txt" word-seq)) #'user/corpus > (take 21 corpus) ("YOU" "don't" "know" "about" "me" "without" "you" "have" "read" "a" "book" "by" "the" "name" "of" "The" "Adventures" "of" "Tom" "Sawyer" \;)
Indexing
The next section of Luke’s code contains a trio of indexing functions that produce a hash-map: the keys are word sequences found in the corpus, and the values are a set of following-word occurrences: each occurrence is a map of a word that was seen following the sequence to a number representing the number of times the word was seen following the sequence.
That last bit was a mouthful, so let me show you an example using my Huck Finn corpus:
> (def indx (build-index corpus 2 1))
Here is the “signature” of build-index
:
(defn build-index [full-corpus key-length completion-length] ...)
It takes a corpus, called full-corpus
to distinguish the complete corpus from the steadily-shrinking portion of the corpus on which an internal loop will recur. The key-length
and completion-length
parameters are integers that determine how many words will be grouped into each key of the index and how many words will be grouped into “completions”, that is, sequences that follow the key in the corpus. In my example, and by default in Luke’s program, the keys will be two words and the completions one word.
So build-index
is instructed to read the sequence of words from my Huck Finn file, reading two words at a time and then noting the word, aka the completion, that follows those two.
- If
build-index
has seen this two-word sequence before (i.e. there is a key matching the sequence in the index hash-map), it checks whether it has already seen the following word after the two-word sequence. - If so, it increments the integer value in the occurrence that matches the word.
- Otherwise, it adds an occurrence: a map of the word to the number 1, for one occurrence
- If
build-index
has not seen this two-word sequence before, it adds a new key-value pair in the index with the two-word sequence as a key and a new set with one occurrence map. - Then the process begins again on Huck Finn minus the word at the “front”. This is what
(recur index (nthrest corpus completion-length))
does in this case. - When the corpus has been processed, the index is returned.
Here’s what I get from having build-index
work on my Huck Finn snippet, with some formatting added into my REPL output to make it more readable:
> (def indx (build-index corpus 2 1)) #'user/indx > (take 5 indx) ([("only" "everything") {("was") 2}] [("was" "no") {("kin") 1}] [("over" "the") {("victuals") 2}] [("was" "all") {("right") 1}] [("I" "was") {("glad") 1, ("there") 1, ("fidgety") 1, ("dead") 1, ("in") 1}])
This tells us that the index has captured certain information about my Huck Finn snippet — for example, the word “was” follows “only everything” twice, and the sequence “I was” is followed by “glad”, “there”, “fidgety”, “dead”, and “in”, each one time.
If you’d like to poke around with the index, you can take advantage of Clojure’s gorgeous treatment of maps as functions: pass a key to a map as though the map were an operator, and you get the value, if any:
> (take 5 (keys indx)) (("only" "everything") ("was" "no") ("over" "the") ("was" "all") ("I" "was"))
> (indx '("I" "was")) {("glad") 1, ("there") 1, ("fidgety") 1, ("dead") 1, ("in") 1}
> (keys (indx '("I" "was"))) (("glad") ("there") ("fidgety") ("dead") ("in"))
Generation
Now we want to take our index and generate some eerily Huck-Finn-like text with it. This is where Markov comes in. As the wikipedia article about Markov Chains explains, there are situations where the a future state can be predicted, in this case by probabilities, solely from a present state—to wit, the to-be-generated text’s contents can be predicted from the index of the corpus. We are building the new text solely out of pieces of Huck Finn.
To implement the probability part of the generator, we have markov-next
, which, as Luke succinctly explains in his documentation, “Given an index and a key, picks a statistically consistent completion”. I want to look at how it does that.
(let [completions (index key)
probabilities (vals completions)
total-probabilities (reduce + probabilities)
random (inc (rand-int total-probabilities))] ...
The index
is our index of sequences and occurrences of completions in Huck Finn. The key
is one of those sequences, maybe ‘(“I” “was”). In that case, here is how the let
‘s bindings evaluate:
completions (index key) ;=> {("glad") 1, ("there") 1, ("fidgety") 1,
("dead") 1, ("in") 1}
probabilities (vals completions) ;=> (1 1 1 1 1)
total-probabilities (reduce + probabilities) ;=> 5
random (inc (rand-int total-probabilities)) ;=> a number between 1 and 5, inclusive
After a defensive check to make sure completions
isn’t somehow nil, the markov-next
function enters this loop:
(loop [current 0 remaining (seq completions)]
(let [next (+ current (last (first remaining)))]
(if (<= current random next)
(ffirst remaining)
(recur next (rest remaining)))))
The loop simply counts up from zero, determining the next number by adding the frequency number of each completion, next
, to its counter current
, and checks to see whether the number random
is now between, or equal to either of, the counter and the counter plus next
.
If you’ve never seen the <=
operator used with more than two arguments—I hadn’t—try doing what I did: use doc
at the REPL to get the documentation for Clojure’s <=
operator:
> (doc <=)
-------------------------
clojure.core/<=
([x] [x y] [x y & more])
Returns non-nil if nums are in monotonically non-decreasing order,
otherwise false.
That’s the Lisp world for you. Somewhere in 50 years of hacking, somebody decided <=
shouldn’t just compare two numbers but rather the order of a whole sequence of numbers. Bless their heart.
So the loop recurs until it can put its random number in the range defined by the frequency of the completion: the more frequent the completion, the more likely that the loop will choose it. When that happens, ffirst
is applied to the remaining completions, with the chosen one at the front. (ffirst x)
is an abbreviated form for (first (first x))
. So, if random
came out to 3, and so remaining
has been recursively set to ([("fidgety") 1] [("dead") 1] [("in") 1])
, then ffirst
will give us “fidgety”, a fine Twainian word, if not quite as good as “fantods”.
The markov-chain
function calls markov-next
recursively to build our new text one completion at a time. It too is lazy, so it will only build as long a text as you ask it to. get-random-key
chooses a starting key for the process.
Ta da!
Let’s try it out:
> (def chain (markov-chain indx (get-random-key indx)))
> (println (apply str (interpose \space (take 50 chain))))
was rough living in the house all the year round --more than a body could tell
what to do with . The Widow Douglas is all told about in that book , which was
no kin to her , and it fetched us a dollar a day apiece all the
That’s pretty good.
Tweaks and future directions
A get-random-start-word
function might ensure that the first random key begins with a capital letter:
(defn get-random-start-word [index]
"Determines a random word, that must begin with a capital letter,
from the supplied index with which to start generation"
(loop [random-key (get-random-key index)]
(if (re-seq #"[A-Z]" (str (first random-key)))
random-key
(recur (get-random-key index)))))
> (get-random-start-word indx)
("I" "warn't")
> (get-random-start-word indx)
("She" "put")
> (def chain (markov-chain index (get-random-start-word indx)))
> (println (apply str (interpose \space (take 50 chain))))
Mark Twain , and you had to wait for the whole world ;
she was going to start a band of robbers , and I was dead . ...
This of course assumes that there is at least one capitalized word in the corpus.
Of course, it would be fun to put this on the web, behind a form for uploading texts or linking to online texts. A Clojure program should be able to create and wrap a servlet, and there’s some buzz about a web library for Clojure called compojure.
Concurrency
It might be interesting to look at introducing some concurrency into the generator. The index
hash-map gets passed around and operated on a lot. Would it be beneficial to make a ref
to it? Then, concurrent indexing or generating would become possible. If we wanted to have index keys and values with more than one word length, index could be built up roughly in parallel, with one process parsing and adding one-word keys and values, another handling two-word keys, etc. Is this the right way to think about concurrency? If anyone has some ideas, I’d love to hear them.
One thought on “On Clojure: A close look at Luke VanderHart’s Markov Chain generator”
Comments are closed.