Recently I was tasked with creating a search engine for our Intranet system. The difficulty there was that none of the content was in a database, it was all on web pages or Apache indexes of file directories. So it was necessary to create a spider in PHP that could crawl the Intranet system, pick up all the links and web content, and somehow index it by keyword in a database.
There are many PDFs, Word Documents and Excel spreadsheets that needed to be indexed as well. So I figured that I could spend hours building a spider, days looking for and building code to read PDFs, DOCs and XLS files, and then weeks and months improving the engine to add features like we find in Google. Or, I could just look for something already built!
So I looked around for an Open Source spider, and stumbled across Sphider. It's Open Source, extremely easy to install. It has the following features:
– Full text indexing.
– Can index both static and dynamic pages.
– Respects robots.txt protocol.
– Follows server side redirections.
– Allows spidering to be limited by depth (ie maximum number of clicks from the starting page), by (sub)domain or by directory.
– Supports indexing of pdf and doc files (using external binaries for file conversion).
– Allows resuming paused spidering.
– Possbility to exclude common words from being indexed.
– Sophisticated administrator interface
– Supports AND, OR and phrase searches
– Supports excluding words (by putting a '-' in front of a word, any page including the word will be omitted from the results).
– Option to add and group sites into categories
– Possibility to limit searching to a given category and its subcategories.
– "Did you mean" search suggestion on mistyped queries.
– Context-sensitive auto-completion on search terms (la Google Suggest)
– Word stemming for english (searching for "run" finds "running", "runs" etc)
– AJAX suggestion list like Google Suggest feature
– Parses & indexed .doc, .pdf, .ppt, and .xls files by using free binaries
It's an excellent tool, I highly recommend it.