Background: A Screen-Scraper
I’ve posted about how I needed to build a little app to scrape product pages for information that could then be summarized and presented in an easy-to-copy manner. I’ll write a little about that, and then I’ll get on to the related Ruby project I’m now undertaking.
So our ideal solution was a tool that would accept a lot of ID numbers, go out and retrieve the web pages, and use Regular Expressions to pick the treasure — product name and information — from the dross. I chose to treat the page’s source as one long string, not as an XML document. I had my reasons: the code our client produced was not (remotely) XHTML-compliant, and XML parsing threatened a lot of processor overhead, XPath, API investigation, lions and tigers and bears…. I chose Java, then my language of choice. I used HttpUnit with its built-in page-fetching abilities and, with JUnit, the ability to write assertions as I went to test things out. I built a primitive Swing/GUI with a nice table, and I found a helpful article by Ashok Banerjee and Jignesh Mehta with cut-and-paste code to enable copying results from the Swing table right into Excel.
The hardest part of the effort, by far, was getting the RegExes to match the pathologically wierd markup of the client’s page. There were divs with id attributes to use as signposts, but the product info was unstructured and full of whitespace. It took hours of trial and error until the happy conclusion. Some time later, the client tweaked their templates, and to adjust the tool required more regular expression tweaking. At that time, I thought, “Gosh it would be great to have a way to just run a regex against the HTML at any URL to see whether it matched the right stuff.” That need began to itch even more after I got a call last month about the tool breaking again — another tweak to the client’s page template, and another couple of hours trying out regular expressions.
The Ruby Pattern Spy: Basics
So, I want a tool that lets me specify a regular expression and a URL to a page to examine. I want the tool to tell me whether the regex matches. I also need to be able to specify match groups and have the tool display those groups to see whether the right stuff is being harvested from the page. Those are the basics.
It would be great to have other stuff in the tool, like the ability to save patterns, URLs, results, etc. I could see setting up regression tests of specific patterns to confirm that a target page was still matching those patterns. As a web tool, the pattern spy could report what people were looking for on what URLs. That’s for when I’m bigger than Google, though. Right now I’d like the basics.
So, how to screen scrape in Ruby? First, I consulted Google and found Scott Laird’s article on screen scraping using HTMLtools to grab the source code and output an REXML document.
Test First
I set to writing some code. I have a continuing interest in eXtreme Programming, as I’ve described in an essay on XP here my site. I particularly like writing tests first. In fact, when I’m exploring new topics, it’s constructive to write tests just to learn how things work. Mike Clark wrote a great description of how he taught himself Ruby by writing an extensive test suite. This interogative approach sounded good, so I set out to write tests that interacted with Ruby and its libraries before I even thought about classes.
For now, I was using my own website’s index as the target page.
client = HTTPAccess2::Client.new
url = "http://www.michaelharrison.ws"
parser = HTMLTree::XMLParser.new(false,true)
parser.feed(client.get_content(url))
doc = parser.document()
title = REXML::XPath.match(doc, "html/head/title[1]/text()")[0].to_s
assert_equal("Michael Harrison: www.michaelharrison.ws", title)"
OK. This works pretty well, but it’s leading back to XPath and potentially lots of problems with noncompliant source, unless I dump out the HTMLTree to a string. But that seemed to me like a complicated way to get a string–using this souped-up XML parser to parse and then reserialize the source code. Inelegant. So I shelved it. It would probably be great, especially as we move into the future age of XHTML compliance (You’re only a day away), to be able to specify XPath serches as well as regexes. I added that to the extras list. We’ll come back to it, I promise.
Now, to find something simpler. Pulling out Ruby in a Nutshell, I puruse the standard library, and there’s Net::HTTP. Much better for me. Now I can use regexes, like so:
require 'net/http'
require 'test/unit'
...
pattern = "<title>(.*)</title>"
url_host = "www.michaelharrison.ws"
url_path = "/index.html"
new_regex = Regexp::new(pattern, Regexp::MULTILINE)
h = Net::HTTP::new(url_host)
resp, source_code = h.get(url_path)
matcher = new_regex.match(source_code)
assert_not_nil(matcher)
title = matcher[1]
assert_not_nil(title, "--> variable 'title' is nil --")
assert_equal("Michael Harrison: www.michaelharrison.ws", title)
Next Time: Getting Minimally Structured