ASP Code

How to Write a Spider/Bot With ASP

Would you like to retrieve and parse the contents of a remote web page with ASP, maybe extract and index all the links? Maybe you're planning to build your own search engine, be the next big Google competitor ;). Well this function will show you how to build that with ASP.

You will need one or two items. To retrieve the pages, you'll be using MSXML 4.0. If you use an older version, you may run into an error with the responseText, where all special/foreign/accented characters are replaced with '?' questions marks. This is due to the encoding, and MSXML 4.0 solves that.

If you are behind a proxy server and you use ServerXMLHTTP code, you will get the error "Access Denied" or "The server name or address cannot be resolved". You need proxycfg. Run it from the command line like this "proxycf -u", and it will copy your proxy settings from IE.

So here's the function get the remote page


'=== grab a web page, return as string
function getPage(strURL)
	dim strBody, objXML

	
	set objXML = CreateObject("MSXML2.ServerXMLHTTP.4.0")
		objXML.Open "GET", strURL, False
		'objXML.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" '===  falsify the agent 
		'objXML.setRequestHeader "Content-Type", "text/html; Charset:ISO-8859-1" 
		'objXML.setRequestHeader "Content-Type", "text/html; Charset:UTF-8" 
		objXML.Send
		strBody = objXML.responseText
	set objXML = nothing
	
	getPage = strBody
end function

Do what you want with the contents of the page. If you want to build it into a spider, extract all the links into an array with either a regular expression or by splitting at '

Share

1 Comments

  1. Hi, I don't suppose you can write one for PHP language also?

    Also, once the spiders retrieve content, how would it know what sections of the search engine to put it in?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Anti-spam image