I just found out that my free link checking tool is being blocked by some websites. My guess is that it's because it's sending a black User-Agent string. I'm going to have to spoof it, say it's FireFox or something. Here's how to do that with cURL and PHP:
// spoofing FireFox 2.0
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
$ch = curl_init();
// set user agent
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
// set the rest of your cURL options here
7 Comments
I can confirm this is the case when trying to scrape Google. With a user agent, the organic results are between and . Without one, they are not included.
Between and
Between HTML comments!
Nice. Why not take it a step further and change your user agent to something random between each download?
Cool Tip!
I just created some applications using Curl e.g. to login to ssl-sites and to scrap some affiliate statistics, but i didn't know about that' possible to set the user-agent.
Although to Ping a site the system("ping… command is much faster than using curl.
Thanks, I just found your tip handy as I'm working on an app that needs setting user agent.
I had created a scraping script in php that worked perfect with file_get_contents, before it was blocked.
Do you have any tips for making this script even more effective, as I still cannot scrape.
I'm am confident though, that cURL has got to be the solution I've been looking for.