Mining the web reading notes crawler

Written by Keith McDonnell. Last updated on Saturday, March 03, 2012.

= Questions

Is erlang a good choice for web crawling?
- yes for a scalable server
- might be too much to learn a new language & IR concepts

What is a good starter project?
- benchmark a crawler in ruby

Should I build a scalable crawler or proof of concept?
- scalable proof of concept (all sections displayed in fig 2.2 pp 21)

Are there ruby libraries for:
- non blocking IO
- DNS fetching
- Canonical URLS
- wrapper for libwww C library

Should I build a basic crawler first (see Mercurio Fernandez’s eigenclass crawler,
or check github).

= Read

Simple crawler

The central function of a crawler is to fetch many pages at the same time, in
order to overlap the delays involved in
1. Resolving the hostname in the URL to an IP address using DNS
2. Connecting a socket to the server and sending the request
3. Receiving the requested page in response

Find anatomy of a large-scale crawler diagram

Because a single page fetch may involve several seconds of network latency, it
is essential to fetch many pages (typically hundreds to thousands) at the same
time to utilize the network bandwidth available.

Many simultaneous fetches are possible only if the DNS lookup is streamlined
to be highly concurrent, possibly replicated on a few DNS servers.

Multiprocessing or multithreading provided by the operating system is not the
best way to manage the multiple fetches owing to high overheads. The best
bet is to explicitly encode the state of a fetch context in a data structure and
use asynchronous sockets, which do not block the process/thread using it, but
can be polled to check for completion of network transfers.

Care is needed to extract URLs and eliminate duplicates to reduce redundant
fetches and to avoid “spider traps”—hyperlink graphs constructed carelessly
or malevolently to keep a crawler trapped in that graph, fetching what can
potentially be an infinite set of “fake” URLs.

a canonical URL is formed by the following steps:
1. A standard string is used for the protocol (most browsers tolerate Http, which
should be converted to lowercase, for example).
2. The hostname is canonicalized as mentioned above.
3. An explicit port number is added if necessary.
4. The path is normalized and cleaned up, for example, /books/../papers
/ simplifies to /papers/

It is easiest to start building a crawler with a core whose only responsibility is to
copy bytes from network sockets to storage media: this is the Crawler class. The
Crawler’s contract with the user is expressed in these methods:

class Crawler {

Implementation of the Crawler class.
We will need two helper classes called DNS and Fetch. Crawler is started with a fixed set of DNS servers.
For each server, a DNS object is created.

Each DNS object creates a UDP socket with its assigned DNS server as the destination.
The most important data structure included in a DNS object is a list of Fetch
contexts waiting for the corresponding DNS server to respond

class DNS {


while event loop has not been stopped do

If you'd like to discuss this article, you can send me an email and/or publish an article online and link back to this page.