RSS



Webbots, Spiders, and Screen Scapers

Wed, May 2, 2007

ArsGeek, Book, Coding, Linux, Reviews

Talk about it in our Forums

Title: Webbots, Spiders, and Screen Scapers: A Guide to Developing Internet Agents with PHP/CURL
Author: Michael Schrenk
ISBN10: 1593271204
ISBN13: 978-1593271206
Publisher: No Starch Press
Cost: $39.99
Format: Paperback, 328 pages
Published: 2007

Webbots, Spiders, and Screen Scrapers contains everything you would need to start writing your own internet agents to perform a variety of tasks. Data aggregation, image capture, link verification and a number of other applications. Very helpful sample code from which a reader could build off of is provided along with some real world experience and advice.

The book is divided into four parts comprised of 28 chapters, three appendices and an index.

The first part, Fundamental Concepts and Techniques is a guide to just what, exactly these scripts are set to accomplish and the basics behind making successful scripts. A rational for creating them and techniques for handling such things as authentication and management of large amounts of data that can be aggregated by them.

Part two, Projects, covers a number of interesting real world projects from basic concepts through code snippets which when put together can create a usable webbot. Some projects include price monitoring scripts, link-verification, search ranking and email reading bots.

Part three, Advanced Technical Considerations brings us into the world of spiders as well as the handling of such things as procurement and sniper bots, handling authentication, cookie management and scheduling your bots and spiders to run optimally (that is to mimic a human browser as closely as possible).

Part Four, Larger Considerations addresses some of the issues that surround writing and using these various bots, spiders and scrapers. Stealth, fault tolerance, the creation of websites that are friendly to webbots, killing spiders (yours or others) and keeping yourself and your creations out of trouble.

There’s a lot of practical knowledge included in this book by way of experience, example and actual code. Knowing the basics, at least of PHP and CURL will allow you to get the most out of it. There’s another practical reason for reading this book which has not as much to do with creating these tools and that’s giving you an understanding of how they work and what they’re actually doing. As a webmaster who’s site is constantly being crawled, spidered and aggregated by various bots this book was fascinating.

It’s important to note that a portion of this book is given over to the discussion of usability and legality. Not everything you could do with this knowledge is legally acceptable or morally right. Of course, the same holds true for any book that imparts knowledge about writing code that can manipulate other people’s data – the onus is on the individual to do the right thing. Schrenk provides a number of real life examples about why this is important including one embarrassing incident involving being banned from sites by the government for overzealous use of a bot. While the story is amusing (and told well) it also serves as a warning to always consider what you’re doing with your creations and how they are affecting other people.

There’s lots of practical code to be found here which is worth it to anyone wanting to learn more about how to construct these scripts. Even if you’re familiar with PHP but haven’t developed many webbots or spiders, a reading of the code will give you some new insight and certainly provide some new and useful techniques.

It’s clear that many useful things can come out of developing your own bots. Even if you or your employer haven’t seriously considered these as useful tools in the past it’s time to take a look at them now. They’re relatively lightweight yet can be very powerful tools for monitoring the web and gathering data which would otherwise take hundreds or thousands of man hours – or may not have even been available to you prior to using them.

Webbots, Spiders and Screen Scrapers will give you a different view of the Internet, and we all know that looking at a tried and true technology in a new light often leads to incredibly useful innovations. This book is a great tool to get your mind wrapped around the sheer amount of data that exists out there and how to best aggregate, manipulate and parse that data for your needs. Sound interesting? Then what are you waiting for?

Technorati Tags: , , , , , , , , ,

Popularity: 1% [?]

Share and Enjoy:
  • Digg
  • del.icio.us
  • MisterWong
  • Reddit
  • Technorati
  • BlinkList
  • Facebook
  • Fark
  • Mixx
  • Slashdot

This post was written by:

arsgeek - who has written 1989 posts on ArsGeek.


Contact the author

1 Comments For This Post

  1. mortgage calculator Says:

    Democracy finally arrived in 1987

Leave a Reply