« The Peter B. Lewis Building at Case Western Reserve | Main | Nyet! »

Of Halo2 and SCRAPIs

So I spent the better part of last night putting together a little Applescript. The intent is to scrape a given Halo 2 player's stats page from Bungie.net for their 'Last Active' time (The date and time when they were last on XBox Live - useful for knowing when you should jump online and catch someone for some party-play.)

Bungie stats pages displays Last Active time

A little curl-ing, a little grep-ing and a little Applescript date manipulation (which is weak btw, in Applescript - no timezones!) and I had it working alright. I'd planned on polishing it up tonight -- Bungie returns all times in Pacific Standard Time, which annoys me, here in the heart of Ohio.

So I'm tying in a quick sidetrip to the Current Time web service at xmlrpc.com, then doing a comparison so I can give a 'x hours ago' or n minutes ago' estimate. (Why the query to a time server? Cause I want to make the script available for download, and I don't want to bother asking people to choose a timezone to do the date/time comparisons.)

Anyway, I am digressing...

So last night the script was working fine. Tonight however, my text munging was coming back all wrong. At first, I suspected that Bungie had made a configuration change to their servers. For a couple of minutes earlier tonight, hitting the stats pages unauthenticated (via .NET Passport) was yielding a different display (it was ommitting Clan Name and Last Active) than when you were signed in.

I actually got a little paranoid: did someone at Bungie.net see my curl calls in the logfiles, and suspect some scraping? Did they password-protect the Last Active date to discourage me? Puh-lease.. I made like 30 attempts last night - barely a blip on their radar, I'm sure. And plenty of other folks are scraping. (On a far larger scale, I'm sure.)

I was already researching ways to get around the authentication problem (wget has cookie- and browser-auth switches, right?) when the problem stopped and non-auth and auth versions of the page synced up again. I think what happened is that they actually changed the displayed information for the Last Active date, and it caused a temporary behavioral blip (probably while the changes propogated out to various servers.)

And... sure enough.. they now display the timezone, PST, alongside the date. (Probably after getting a couple of 'why is this time so wrong?' complaints.) And my crude data parsing methods actually involve find the end of that line and counting backwards to extract the pieces I need. (Ghetto, I know, but my perl chops are nonexistent and I just wanted to cobble this thing together in a night. So there!)

Which all goes to point up the well-known danger of scraping data from a website for use by some format-sensitive script: it's a website. It will change frequently, and frequently without any warning. (And, to be perfectly honest, it seems to be against the site's Terms of Service (read that bit about 'Personal and Non-Commercial Use'.)

Now Bungie does provide RSS feeds of player stats. (Which is cool.) All you'd have to do is look at the first (chronologically, the last) entry's PubDate, and you're good as gold. (btw, I'll point out that all dates in the RSS file are GMT which, while no better for me here in Columbus, at least don't come off as quite so Redmond-centric as Pacific time.)


The added benefits of getting it from the RSS? You can be sure the format won't change quite so arbitrarily (no presentation dust-ups every couple weeks) and I wouldn't feel quite so ... dirty, getting the data this way. I can't say that the TOS necessarily encourages data manipulation of RSS vs. the HTML way, but one would think that Bungie reasonably expects people to take the RSS and have fun with it. I suspect that the data-scrapers (like, the serious ones that are building full-on stats-package websites for Halo2) give them headaches. They'd certainly explain why the site feels so damn sluggish at times. I've also read that they don't bother with the RSS, cause it doesn't include as much raw data as you can get from the scraped site.

So I'd like to get the Last Active time via the RSS feed, but I'll be danged if I can figure out how to derive a given players RSS url from nothing more than their gamertag. Has anyone out there found a way to do this? (It's probably some brain-dead simple scheme, but remember - I wanted to do this in an hour!!)

Okay, I've probably taken more time writing this entry than I wanted to spend on the script itself. Carry on... (Oh, and I'll post up that Applescript once I've got the kinks worked out.)

Comments (1)


As the human behind that blurred-out GamerTag above ... I so want this script.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


This page contains a single entry from the blog posted on March 14, 2005 10:19 PM.

The previous post in this blog was The Peter B. Lewis Building at Case Western Reserve.

The next post in this blog is Nyet!.

Many more can be found on the main index page or by looking through the archives.


Subscribe to feed Subscribe to my feed
Powered by FeedBurner
Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 3.33