Would you like to retrieve and parse the contents of a remote web page with ASP, maybe extract and index all the links? Maybe you're planning to build your own search engine, be the next big Google competitor ;). Well this function will show you how to build that with ASP.
You will need one or two items. To retrieve the pages, you'll be using MSXML 4.0. If you use an older version, you may run into an error with the responseText, where all special/foreign/accented characters are replaced with '?' questions marks. This is due to the encoding, and MSXML 4.0 solves that.
If you are behind a proxy server and you use ServerXMLHTTP code, you will get the error "Access Denied" or "The server name or address cannot be resolved". You need proxycfg. Run it from the command line like this "proxycf -u", and it will copy your proxy settings from IE.
So here's the function get the remote page
'=== grab a web page, return as string
function getPage(strURL)
dim strBody, objXML
set objXML = CreateObject("MSXML2.ServerXMLHTTP.4.0")
objXML.Open "GET", strURL, False
'objXML.setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" '=== falsify the agent
'objXML.setRequestHeader "Content-Type", "text/html; Charset:ISO-8859-1"
'objXML.setRequestHeader "Content-Type", "text/html; Charset:UTF-8"
objXML.Send
strBody = objXML.responseText
set objXML = nothing
getPage = strBody
end function
Do what you want with the contents of the page. If you want to build it into a spider, extract all the links into an array with either a regular expression or by splitting at '
1 Comments
Hi, I don't suppose you can write one for PHP language also?
Also, once the spiders retrieve content, how would it know what sections of the search engine to put it in?
Thanks