PHP Code, Web

A great free PHP spider and search engine

Recently I was tasked with creating a search engine for our Intranet system. The difficulty there was that none of the content was in a database, it was all on web pages or Apache indexes of file directories. So it was necessary to create a spider in PHP that could crawl the Intranet system, pick up all the links and web content, and somehow index it by keyword in a database.

There are many PDFs, Word Documents and Excel spreadsheets that needed to be indexed as well. So I figured that I could spend hours building a spider, days looking for and building code to read PDFs, DOCs and XLS files, and then weeks and months improving the engine to add features like we find in Google. Or, I could just look for something already built!

So I looked around for an Open Source spider, and stumbled across Sphider. It's Open Source, extremely easy to install. It has the following features:

– Full text indexing.
– Can index both static and dynamic pages.
– Finds links in <a href=…>, <frame …>, <area …> and <meta …> tags, and can also follow links given in javascript as strings via window.location and window.open.
– Respects robots.txt protocol.
– Follows server side redirections.
– Allows spidering to be limited by depth (ie maximum number of clicks from the starting page), by (sub)domain or by directory.
– Supports indexing of pdf and doc files (using external binaries for file conversion).
– Allows resuming paused spidering.
– Possbility to exclude common words from being indexed.
– Sophisticated administrator interface
– Supports AND, OR and phrase searches
– Supports excluding words (by putting a '-' in front of a word, any page including the word will be omitted from the results).
– Option to add and group sites into categories
– Possibility to limit searching to a given category and its subcategories.
– "Did you mean" search suggestion on mistyped queries.
– Context-sensitive auto-completion on search terms (la Google Suggest)
– Word stemming for english (searching for "run" finds "running", "runs" etc)
– AJAX suggestion list like Google Suggest feature
– Parses & indexed .doc, .pdf, .ppt, and .xls files by using free binaries

It's an excellent tool, I highly recommend it.

Share

2 Comments

  1. Excellent, exactly what I was looking for – although I don't want to create a search engine – I just need the spider part – and I'm hoping its flat file..I can now start parsing for content 😀

  2. Emmanuel Jean-François Jotterand

    Where can I find this tool? The linked page is no more available… Thank you for your response.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Anti-spam image