Escaped Thoughts

Sat, May 31, 2003

Google is Watching

I was going to set up a Google news scraper for my personal start page, so that I can get just the headlines, and just the topics that interest me (the sports section is a waste of screen space for me). I ran into a strange 403 Forbidden accessing it from python, though; it turns out that Google blocks access by User-Agent, to discourage exactly what I was trying.

It took about 30 seconds to find code to set the User-Agent in python's urllib class... Ironically, I found it using Google itself. However, after reading about how Google bans IPs that are caught doing any kind of automated queries, I'm afraid to use my workaround. Assuming they are using some sort of semi-decent AI to look for automation, regenerating my start page every hour of every day is likely to be noticed, and I would be lynched if Google banned CWRU. So I guess that will have to wait until I have a server where I can build the pages dynamically on request, instead of at set intervals. It's irritating, since this is definitely 'fair-use', since it's entirely for my own use, but I guess I understand the need for something like this.

The scary thing is, it seems like there is tremendous capacity for abuse in this draconian enforcement. What's to stop me from picking a victim IP address or domain, and sending tons of automated queries with a spoofed IP address? Bam, instant Google ban, quite possibly for the whole domain. In addition to random acts of evil, this could have really nasty corporate sabotage implications:

  1. Step 1: Open a local ISP
  2. Step 2: Pick a competing ISP, and ensure that Google bans their entire subnet
  3. Step 3: Profit! Watch the customers leave the ISP for one where they can actually perform Google searches

I'm beginning to understand why some people fear Google's power.

Other's Thoughts

Liberate Your Thoughts:

Name:
URL/Email:(optional)
Title:(optional)
Comment:

Comments containing links may be placed in a holding area pending review.

TrackBack ping me at: http://www.escapedthoughts.com/weblog/geek/P030531google.trackback