One of the challenges of our exile from New York City has been our consternation with the street grid (or lack thereof) in our new neighborhood. We’re basically one big cul-de-sac off Route 1. Furthermore, instead of a rational system of straight (and continuous) streets running along cardinal directions, we have something that resembles a model of the intestines, or a particularly complicated plumbing diagram. In short, we live in a labyrinth.
At breakfast this morning, I asked Elizabeth what her impressions of Rosecroft were. I hadn’t felt like my own thoughts, expressed in the previous post, were either perceptive or interesting. Not surprisingly, she had excellent points to make.
My wife’s uncle is in the DC area this weekend. He doesn’t travel much at all outside Florida, where he retired early some time ago. But he comes up to DC at least once a year to visit his longtime G.P. and to attend the horseraces at Rosecroft Raceway. Elizabeth pointed out we’ve moved to the only part of the country outside the Sunshine State’s Gulf Coast where we’re ever going to see Bill. Tonight we took advantage of that silver lining of our exile from New York City to meet Bill and a number of his friends at Rosecroft.
I’ve never been to a racetrack before. I expected some seediness. I anticipated there would be sad- and rough-looking men there. I was correct. But I soon appreciated how easy it was to forget about what other people looked like, indeed that they were even there, once I got my hands on the racing program and began pulling dollar bills out my wallet and holding them folded in my fist as I pondered the advantages of trainer-driven trotters and looked for horses that had been given over-long odds. In short, I can see why people like betting on horses. If you’re perceptive, clever, and persistent enough, you ought to be able to make money just for sitting at a table or in a chair in front of a closed-circuit television.
Background: A Screen-Scraper
I’ve posted about how I needed to build a little app to scrape product pages for information that could then be summarized and presented in an easy-to-copy manner. I’ll write a little about that, and then I’ll get on to the related Ruby project I’m now undertaking.
So our ideal solution was a tool that would accept a lot of ID numbers, go out and retrieve the web pages, and use Regular Expressions to pick the treasure — product name and information — from the dross. I chose to treat the page’s source as one long string, not as an XML document. I had my reasons: the code our client produced was not (remotely) XHTML-compliant, and XML parsing threatened a lot of processor overhead, XPath, API investigation, lions and tigers and bears…. I chose Java, then my language of choice. I used HttpUnit with its built-in page-fetching abilities and, with JUnit, the ability to write assertions as I went to test things out. I built a primitive Swing/GUI with a nice table, and I found a helpful article by Ashok Banerjee and Jignesh Mehta with cut-and-paste code to enable copying results from the Swing table right into Excel.
The hardest part of the effort, by far, was getting the RegExes to match the pathologically wierd markup of the client’s page. There were divs with id attributes to use as signposts, but the product info was unstructured and full of whitespace. It took hours of trial and error until the happy conclusion. Some time later, the client tweaked their templates, and to adjust the tool required more regular expression tweaking. At that time, I thought, “Gosh it would be great to have a way to just run a regex against the HTML at any URL to see whether it matched the right stuff.” That need began to itch even more after I got a call last month about the tool breaking again — another tweak to the client’s page template, and another couple of hours trying out regular expressions.
The Ruby Pattern Spy: Basics
So, I want a tool that lets me specify a regular expression and a URL to a page to examine. I want the tool to tell me whether the regex matches. I also need to be able to specify match groups and have the tool display those groups to see whether the right stuff is being harvested from the page. Those are the basics.
It would be great to have other stuff in the tool, like the ability to save patterns, URLs, results, etc. I could see setting up regression tests of specific patterns to confirm that a target page was still matching those patterns. As a web tool, the pattern spy could report what people were looking for on what URLs. That’s for when I’m bigger than Google, though. Right now I’d like the basics.
So, how to screen scrape in Ruby? First, I consulted Google and found Scott Laird’s article on screen scraping using HTMLtools to grab the source code and output an REXML document.
Test First
I set to writing some code. I have a continuing interest in eXtreme Programming, as I’ve described in an essay on XP here my site. I particularly like writing tests first. In fact, when I’m exploring new topics, it’s constructive to write tests just to learn how things work. Mike Clark wrote a great description of how he taught himself Ruby by writing an extensive test suite. This interogative approach sounded good, so I set out to write tests that interacted with Ruby and its libraries before I even thought about classes.
For now, I was using my own website’s index as the target page.
client = HTTPAccess2::Client.new
url = "http://www.michaelharrison.ws"
parser = HTMLTree::XMLParser.new(false,true)
parser.feed(client.get_content(url))
doc = parser.document()
title = REXML::XPath.match(doc, "html/head/title[1]/text()")[0].to_s
assert_equal("Michael Harrison: www.michaelharrison.ws", title)"
OK. This works pretty well, but it’s leading back to XPath and potentially lots of problems with noncompliant source, unless I dump out the HTMLTree to a string. But that seemed to me like a complicated way to get a string–using this souped-up XML parser to parse and then reserialize the source code. Inelegant. So I shelved it. It would probably be great, especially as we move into the future age of XHTML compliance (You’re only a day away), to be able to specify XPath serches as well as regexes. I added that to the extras list. We’ll come back to it, I promise.
Now, to find something simpler. Pulling out Ruby in a Nutshell, I puruse the standard library, and there’s Net::HTTP. Much better for me. Now I can use regexes, like so:
require 'net/http'
require 'test/unit'
...
pattern = "<title>(.*)</title>"
url_host = "www.michaelharrison.ws"
url_path = "/index.html"
new_regex = Regexp::new(pattern, Regexp::MULTILINE)
h = Net::HTTP::new(url_host)
resp, source_code = h.get(url_path)
matcher = new_regex.match(source_code)
assert_not_nil(matcher)
title = matcher[1]
assert_not_nil(title, "--> variable 'title' is nil --")
assert_equal("Michael Harrison: www.michaelharrison.ws", title)
Next Time: Getting Minimally Structured
Fun stuff, given that these product pages had lots of widgets on them and took about seconds to load. There were often 100-200 products in a promotion, so repetitive browser location bar manipulation and copy-and-paste into Excel was consuming. It gets better, too: the client didn’t have up to date product info either–we had to look that up on these pages and harvest that info too, for the promotion we pages we were developing. In all, it could take an hour just to hunt this info down, and with several promos coming in each week during busy times, this ate up a lot of time I should have been spending looming over the developers’ shoulders (i.e. managing).
In the old days, we would have asked the client for database access, so we could hit their product table(s) for the information. After all, the information on the web pages is centrally managed in the DB. But in the real world, a corporate client simply doesn’t do that. It takes 4 weeks to find out who manages that database, another 2 weeks to set up a conference call, and then legal or the CTO issues a refusal to grant access to outsiders. Today, however, Internet technology allows us to treat websites as data stores, and seemingly decentralized information can be wrangled (or “aggregated” if you want to sound serious), even if there’s no single database, or single owner behind all of it.
To make a long story short (and I admit I am overfond of typing away), I built a Java app that went out and grabbed the products’ pages, used Regular Expressions to identify and capture key information, and displayed the info in a table for easy copy and paste. I’ll describe the Java app in detail. What’s important about this little project is that I came to wish it were easier to identify the right regexes: there was a lot of trial and error, and I had to rerun the application each time to see whether it would return good results or garbage. I wanted to be able to test a regex instantly.
So, now, it’s time to scratch the itch: time to build a tool to run regexes against a URL. And I’m going to set out to do it with Ruby, which is impressing me every time I use it. I’m going to post here about interesting or frustrating things I find as I work. The project-in-progress itself will be described on separate pages.
Little boots it to the subtle speculatist to stand single in his opinions — unless he gives them proper vent.
-Sterne, Tristram Shandy
Your author, gentle reader, is Michael Harrison, mostly self-taught computer programmer and sometime wordsmith. This is his blog, in which he will (hopefully) write informative and incisive posts about technology, writing and literature, and life in his new home of College Park, Maryland, USA.
Lately (fall 2007) I’ve been investigating Lisp and functional programming by taking the MIT OCW course Structure and Interpretation of Computer Programs. I am writing an awful lot about that right now.
You may reach me at mh at this eponymous web domain.