Writing a large-scale web-crawling and web-indexing search engine from scratch is a large beast to tame and in many cases a project that is heading for desaster right from the start. As you can read in Alex's worklog for the ongoing effort to manage the Majestic-12 distributed search engine, writing a crawler alone can cost you a lot of time and hair, let alone the giant task to write all three modules: the crawler, the indexer and the search frontend... but let's start from the beginning.
Care about copyright! and some other essential things
Before you start, you will want to take some time and think about the implications and legal aspects of running a search engine. Crawling the web at high speed will put a lot of stress on networks and foreign servers. It's good to play by the rules. You should do the little extra work to make your robot a good robot and not let it behave like many of the spam-bots out there. As a good start, read the robot guidelines here.
OK, now after reading what a good robot should look like, let's think about a few more things - first of all: copyright. Crawling and thus downloading large portions of the web is in it's essence a breach of copyright. Many pages online forbid such procedures. BUT, they do allow google, yahoo and all the other search engines to crawl, download and index their sites. Why? Because these crawlers play by the rules. And the rules for a search engine are that, yes, you can download and store content but you should do it to send visitors to their sites. All storage of copyrighted content should only be done for the use of building the search index, it should be limited in terms of how long you will store this data and you should give proper indication of where and when you accessed a certain page.
The next thing to keep in mind is that your robot should follow the robots.txt rules. The robots.txt file is, as you already know from reading the "Guidelines for a well-behaved robot", a simple ruleset for machines visiting a server. The webmaster informs your agent with this simple text file about what part of a website may be indexed and, more important, what part may be not. So, by all means, follow the robot.txt rules. In perl, you can use a module like LWP::RobotUA , WWW::RobotRules or similar to effectively do so.
The last essential thing to keep in mind is: properly identify your crawler as a search engine crawler in the User Agent String! This page here has a list of the most important robots out there. You can see: everyone else does it and it's good practice, so do it as well. Be sure to identify your crawler, for example, like this:
require LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->agent('My Crawler. Visit http://www.foo.bar/robot.html for more information');
and please also include some email, url or similar on how you can be contacted so that a webmaster who notices your crawler running wild can contact you and thus help resolve the issue.
The basic design of a search engine
When thinking of how a search engine will work and how you will actually implement all that knowledge, you will soon discover that a modern search engine consists of several specialised components. Read about it in Howstuffworks or The Anatomy of a Large-Scale Hypertextual Web Search Engine by the Google Guys - whatever you do: get used to reading stuff like that. As I pointed out earlier, searching is pushing the limits of modern computing, so get used to reading scientific stuff like that.
Now, as a search engine might appear to be a simplistic construct, it is really a complex thing to master. And as each part of that complex thing is destinned to do certain very special things, you can break the complex of a search engine down into two, or as I like to see it, three major parts: the crawler, the indexer/merger and the frontnend .
The crawler, spider or web robot is your content aggregator. It scours the web and systematically downloads websites, discovers links, follows these extracted links and again downloads these newly discovered resources. While traversing the web, a crawler will exncounter all kinds of things. You must decide on the protocols the crawler supports (http only, or mms, ftp, etc as well), wheather it downloads only text sources (html, php, asf and all that other extensions) and how it recognizes these content types.
A good start to give the crawler a certain knowledge about the resource is to have it look at the actual extension, like the .html ending. But many resources have no ending at all, like the "http://www.foo.bar/" root page and the like. A more cleverly approach would be to look at the content-type the server sends and actually have the crawler itself send the
HTTP_ACCEPT header "
While speaking of headers, a good idea would be to implement support for compressed/gzipped pages. Many servers support this feature and you can download pages in a fraction of the uncompressed size/speed. But keep in mind that the perl LWP library doesn't handle gzipped pages by default. You need to have Compress::Zlib and the associated libraries installed and content needs to be accessed via the $response->decoded_content function of LWP.
The next thing is Encoding. Once there were 255 characters on one codepage, that were the 50s, 60s and 70s. A bit later, computer science discovered that there were more characters than the once defined characters of the ASCII alphabet, like Umlauts and such. The internet happend right inbetween that and the next iteration was an ugly hack that is called Character Encoding. Yes, there were 255 positions on that codepage to fill with letters, but now each Country/Language used different characters referenced by the same id on the the codepage. You can only know which codepage to use by labelling it with names like latin-1 (or iso-8859-1), SHIFT_JIS etc. A disaster. When browsers arrived it got even worse. A thing called HTML Entities was instroduced. This enabled users to use the raw US-ASCII codepage and render umlauts and special characters with the help of descriptive patterns like " "
Finally, UTF-8 saw the light of day. The idea was, as can be read here, to use two bits instead of just one, thus creating a much larger codepage with space for every imaginable character, making Codepages and Entities a thing of the past. Anyway, there a re many pages out there that could be server in plain utf8 but are still using HTML Entities on top of that, some claim to be in us-ascii while they are in reality latin-1 encoded and on and on and on.
The lesson you can learn from that is that your crawler should be able to read and recognize all that, as a last resort by guessing the used encoding. A good idea would be to convert everything to UTF-8. You can use decoded_content() for that and all the tools of the Encode module. One thing: use a newer, utf-8 ware version of perl if you'd like to do so.
After your crawler has finally successfully fetched some content (more pitfalls here: 404 errors, redirections, revisits you need to schedule for later) you need to parse the content in one way or another. Use a readily available HTML parser for that, writing a parser is a road you won't go down! Nested data structures are a thing that can make a man mad. And: Don't even think about using regular expressions/regexes for anything in that direction!
The last step is link extraction: You need to decide wheater your crawler inits the linkextract or if a dedicated process does so. How will your extor handle or detect already visited URLs: a bloom filter? a sorting algorithm with uniq? Think about that!
And how do you store all these URLs and all this content for later analysis? You can read about the limits of traditional storage below, but be sure to have a clever structure for the organisation of all this data. My advice, without giving too much away, would be to use buckets and chunks of data. How exactly you can implement this is another homework left for you, but make sure you use compression, some kind of a bucket index (maybe blend in some SQL here, but only some, see my comments below) and clever sorting/footprint reduction techniques for the URL lists may also help.
Puh, you made it throught, everything works. But what is that? It is sloooow! Now comes the interesting part, it works, but it is an expensive operation. More homework here: DNS lookups needs to be cached, utf8 operations/detection are too expensive, the parser is a resource hog... Some requests time out and all of that needs to be scalable, concurrent and running on multiple machines and in multiple instances. You see, the crawler alone can cost you hair...
>> See the post Search engine crawler references for more in-depth input on this part of a search engine.
OK, you've made it this far. You have extracted thousands of unique urls, visited them and downloaded the content wich is really only textual content. The next step is now: indexing it.
Most search engines, at least all the seriously large ones, don't use techniques to really do fulltext searches on the content. Opening all those files and running regexes on the text in them would takes ages once you surpassed the mark of a few hundred content objects. The solution to this problem is turning the whole idea upside down and using what is called an inverted index.
An inverted index is like the index in a book. Texts are analysed and your process stores each word, each new word it finds, in a kind of dictionary, remembering the position of the word on-the-way. This dictionary, upon search, is used to identify the objects or positions within your body of content that match your keyword and are used to compute the results list. On a search tat consists of multiple keywords, the searcher identifies the matching entries/objects for each of thos words and then identifies the intersectioning entries.
So, write a routine that harvests all single words from your fetched content and compiles a list of what is called postings. Then arrange your content in a content db accompanying your word index.
As simple as it sounds, as difficult it is in reality. How do you differentiate between "book" and "books", will it be the same word or will it be regarded as different concepts upon search. The quality of your index is crucial for the quality of your search engine's results. A user looking for "short films" should get similar matches when looking for "shortfilms" - at least in theory.
Writing a good indexer is as demanding as writing a good crawler. Just some keywords to get you started: you should be aware to linguistic morphology, stemming, a fixed dictionary might help with popularity ranking, synonyms and concepts. Everybody can write a script that downloads stuff from the web (ok it's a bit more demanding), but actually making sense of it is another cup of tea and still a big issue in information retrieval and research.
Actually, the functionality described above is the more traditional approach to searching. It is fast for a keywords based search, but nearly useless when you need to compute associations (like a human would do), similar objects or for answering semantic search requests. So it is up to you to invent the next best search engine. Tell me about your successes. It might be a Vector Search approach, a semantic engine, or a mixture of new techniques.
Anyway, the result of your indexer will, in most cases, be a kind of solid search index, a large construct of your dictionary and your indexed content - in one way or another. This search index will be accessed by your frontend for the search results.
Also, your index needs to be updated, re-checked (if indexed sources still exist) and merged with newly discovered content. Merging an index with a new chunck of data is an expensive operation. You will need to develop routines or systems to do it in a quick and feasable manner. Solutions might be to use multiple larger indexes, incremental merging or the like. An interesting problem to mediate about.
The web front-end (presentation layer)
Actually, developing the web frontend is the easiest part. In most cases you will have guidelines in terms of design and layout that were developed in conjunction with the overall userexperience that was chosen for the project and the overall corporate identity tha twill be in effect. Remember that you should inform the user about returned objects as much as possible while being brief and simple. And "form follows function"! Google is a good example for a text search engine that's on concise on text results. The results layout of video search engine Lumerias is also simplistic by design but has some extensions and modifications that give respect to the nature of presenting video results.
Talking mainly about design here is due to the fact that the underlying technology for the search should already be available from you writing the indexer. The searcher component is the corresponding negative (or positive, depends on how you put it) to your indexer technology. The searcher reverses the algorithm of your indexer and presents matching results based on the user's input. As simple as that.
The limits of current hardware and software
When you start to write your code, regardless of the language you do it in, you will soon reach the limits of what's possible in information retrieval running code on current hardware and the way software works today. Your piece of software will munge billions of urls, will see thousands of doublettes, errors and pitfalls and will generate Terabytes of traffic - so be prepared for a bumpy ride. And this isn't only the case when you try to do it all in perl...
Rule 1: Filesystem is always too slow
When thinking about storing stuff, your first idea might be: "let's just save it to filesystem", one data object = one file. Nice idea, but a mistake! The filesystem mgith work similar to a database but it's actually too slow when you rely on it in thousands of operations. As a rule of thumb you should try to avoid filesystem accesses as good as you can. Load chunks of data from disk into memory, process them and write results back. But do it chunck-wise and not for every object. That's the way to go!
Rule 2: SQL is too slow
Regardless of how many indexes you create, how optimized your queries are and how fast and well-scaled your MySQL or whatever server might be: SQL will always be too slow for the job! SQL might be great to do complicated queries but its absolutely useless to store URLs in it or store page content or anything that will go into the hundreds of thousands of objects. Believe me, I've been there. In-memory databases are no cure. And even committing 1,000 pooled queries at once won't give you that extra speed you'll need. Keep it simple! (repeat two times) and just use flat-files combined with a system of chunks of data. Process them one by one and you will be a big step closer to your goal of actually getting it to run longer than a few hours without out-of-memory of a frozen system.
Rule 3: Storage is always too small
After running your search engine for a few hour, days, weeks. Sooner or later you will run out of storage. Your first reflex might be to add a few drives. The common idea is to add a physical abstraction layer between your OS and your drives and virtualize your storage volumes. Nice idea, but you will see that when you keep adding drives each day, there will come a day when that mean time between failure rate (which appears to be quite low from a personal perspective) will starting to hit you on the professional level with thousands of drives spinning. So by all means, expect failure and implement measures to face them. Google uses the Google File System, a cluster of commodity hardware that doesn't try to be flawless but instead backs up itself and makes desaster recovery for single nodes a simple process. Learn from that and use the Hadoop framework for the heavy lifting. The adventurous might as well outsource the storage part to Amazon's EC2 and S3 services, but be prepared for some hefty invoices when doing so on a search service.
I hope this small quide is a good start for you, the aspiring search engine coder. I was involved in the development of the video search engine Lumerias.com and these are just a few of the issues we faced while developing that beast. This article is far from complete.
Again, if you would like to learn more about search engines, be sure to milk citeseer for all whitepapers you can get, you might as well read Anna's amusing article on how to write a search service over at ACM. The out-of-the-box search engine Lucene can be a good source of inspiration. Are you crawling the whole web, or will you run a focused crawler? Nutch can be a good point to start without writing a single line of code.