goodmike blogs here | Using Ruby for Screen-Scraping and Pattern-Matching

In my last full-time gig, I automated a recurring management task: looking up the “official” names of a client’s products when said products were included in a promotion. This client was actually a holding company — it didn’t own or manage most of its products — and the actual corporate entities that did manage the products liked to change the names of the products from time to time. In other words, it was a decentralized system. At my office, we only had product IDs, and we were told the only way to find out the proper name of the product was to go to the client’s website and look up the product by that ID: www.thiscompany.com/a/long/path/to/productpage?productID=xxxx

Fun stuff, given that these product pages had lots of widgets on them and took about seconds to load. There were often 100-200 products in a promotion, so repetitive browser location bar manipulation and copy-and-paste into Excel was consuming. It gets better, too: the client didn’t have up to date product info either–we had to look that up on these pages and harvest that info too, for the promotion we pages we were developing. In all, it could take an hour just to hunt this info down, and with several promos coming in each week during busy times, this ate up a lot of time I should have been spending looming over the developers’ shoulders (i.e. managing).

In the old days, we would have asked the client for database access, so we could hit their product table(s) for the information. After all, the information on the web pages is centrally managed in the DB. But in the real world, a corporate client simply doesn’t do that. It takes 4 weeks to find out who manages that database, another 2 weeks to set up a conference call, and then legal or the CTO issues a refusal to grant access to outsiders. Today, however, Internet technology allows us to treat websites as data stores, and seemingly decentralized information can be wrangled (or “aggregated” if you want to sound serious), even if there’s no single database, or single owner behind all of it.

To make a long story short (and I admit I am overfond of typing away), I built a Java app that went out and grabbed the products’ pages, used Regular Expressions to identify and capture key information, and displayed the info in a table for easy copy and paste. I’ll describe the Java app in detail. What’s important about this little project is that I came to wish it were easier to identify the right regexes: there was a lot of trial and error, and I had to rerun the application each time to see whether it would return good results or garbage. I wanted to be able to test a regex instantly.

So, now, it’s time to scratch the itch: time to build a tool to run regexes against a URL. And I’m going to set out to do it with Ruby, which is impressing me every time I use it. I’m going to post here about interesting or frustrating things I find as I work. The project-in-progress itself will be described on separate pages.