When starting the coding/design of a new search engine crawler (a challenging task, read What I've learned from writing a large scale search engine), some input from people who've been there and done that might be quite helpful. Of course, there is the original Google paper and some other stuff around that I like to link here. Feel free to tell me of more sources to add so we can keep this post as a WIP page for helpful design input.
- Implement support for newer metadata standards: sitemap.xml (official whitepaper) and/or OAI-PMH
- Decide if you want web wide crawling or focused crawling
- Consider DHTs for crawl cue handling
- think in chunks of data/urls, "buckets"
- never try to keep all at once in memory
- hash over a number of machines, filter sets and distribute further handling
- Consider a distributed, fault tolerant filesystem for data storage, like Hadoop's HDFS, a kind-of GoogleFS clone
- Invent a system to identify seen/visited URLs and todo-URLs
- Get familiar with the MapReduce way (illustration) of handling large data sets.
- Forget about that idea to store URLs one after another into an SQL database!
- Use a cluster of machines, also see Grub etc. below
- Cache DNS requests
- Solve the problem of crawling (thus writing to the index) while the searcher is reading/accessing the index
- Obey robots.txt, and cache a copy for each host to limit requests
- Distribute load on hosts, so you won't do too many requests at once on a single machine (effectively becoming a DoS attacker)
- Implement a system to revisit and re-fetch a certain URL after a certain time
- Implement a way of identifying duplicate content, either on crawl-time, search-time or both. See perlmonks on Levenshtein, WagnerFischer and the like. And to weed out duplicate resources due to query-strings, SessionIDs etc.
Crawler design references (Papers)
The anatomy of a large-scale hypertextual Web search engine -- Brin, S. and Page, L. (1998).
is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.
According to Googlebot's wikipedia documentation, there are two parallel bots at least, deepbot and freshbot. Deepbot, the deep crawler, tries to follow every link on the web and download as many pages as it can to the Google indexers. It completes this process about once a month. Freshbot crawls the web looking for fresh content. It visits websites that change frequently, according to how frequently they change.
An adaptive model for optimizing performance of an incremental web crawler -- <Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001).
Among other stuff, an analysis on how to handle obsolete and relevant urls in a url bucket.
Search Engines and Web Dynamics -- Risvik, K. M. and Michelsen, R. (2002).
From the makers of the FAST alltheweb.com crawler
Design and implementation of a high performance distributed web crawler. -- Shkapenyuk, V. and Suel, T. (2002).
Their "PolyBot" is a distributed crawler written in C++ and Python, which is composed of a "crawl manager", one or more "downloaders" and one or more "DNS resolvers". Collected URLs are added to a queue on disk, and processed later to search for seen URLs in batch mode. The politeness policy considers both third and second level domains (e.g.: www.example.com and www2.example.com are third level domains) because third level domains are usually hosted by the same Web server.
Dominos: A New Web Crawler’s Design -- Youn`es Hafri, Chabane Djeraba (2004).
Crawler design references (Projects, source-code and docs)
Nutch is a full featured open-source search-engine. It can do focused search and web-wide crawls, relies on Hadoop for storage and is ope-source so you can peek at the code and find out about its secrets and algorithms. Searching the index is done via Lucene.
The Banana Tree, a search and spider project is a bit dated but Harry gives a lot of insight and sample perl code! For example: Indexing the internet, building a spider and a primer on vector search. If you want to know more about vector search, read this and try this Search::VectorSpace.
Wikia and Majestic12, among others, use techniques of distributed crawling. Maybe you should learn from that. A bit of insight gives the Wikipedia article on Grub (which is now owned by Wikia) or the docs. Even more interesting is Majestic12 and their very responsive message board.
There is a project related to Grub/Wikia to build an open Index, see the IOI project for more.