Monday, September 6, 2010

A buzz network crawler

Last but not least in the last few hours before summer unofficially ends, I've updated my buzz network crawler that I first released back at the end of May. The crawler is code written in Python that follows symmetric ties on buzz networks to an arbitrary depth. It makes use of Google's Buzz REST API.

The new code is a major update with the following features:

  • The user may specify the initial seed network of buzz users. Typically, this seed network will be some group of people who are socially connected. In my case, it's always been students in a class, but it can be anyone. The members of the seed network may be specified with email or numeric IDs. Both are shown in the examples.
  • The user may specify an arbitrary crawl depth beyond the initial seed. For instance, by specifying a craw depth of 2, the crawler will crawl the initial seed group, their connections, and the connections of those connections. It should be noted that crawl depth appears to increase run time exponentially. The crawl to depth 2 for results described here and here took approximately 22 hours and encompassed 10,000 network members.
  • The crawler uses a memoization pattern to reduce expensive calls to the buzz API for network members with multiple inbound connections from other network members (applies to almost all members).
  • The crawler accounts for and gracefully handles known deficiencies in the Buzz REST JSON API. Specifically,
    • The crawler is able to process non-standard control characters embedded in JSON return documents.
    • The crawler recovers gracefully from 503 Errors occasionally returned by the API when the user is following a large number of people.

Known Limitations

The code should be considered alpha quality. As I was writing it, I had to simultaneously:

  • figure out the problem(s) I was actually solving. This should be apparent for anyone comparing the capabilities of the initial release with this one.
  • Learn new programming constructs in the Python language. Not least of these was discovering an effective mechanism for managing key-object databases.
  • Develop an effective strategy for reducing run time to manageable levels.
  • Develop a strategy that was robust against API errors. This was critical as crawl times of 24 hours or greater are sometimes required.

Documentation is minimal and contained in the README files found in the code download.

This post would be remiss without recognition to the many Google employees who gave me help in the Buzz Google Group.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.