Insights from Our Experts
Crawl without getting busted!!!!
To a certain extent crawling is easy and constant. Never to mention what we could do with the data we obtain. Converting the data obtained to other forms and representations, searching within seconds and retrieving data from google is a major example to be pointed.
We have only discussed about the bright sides of crawling. But there are situations when the data to be obtained is inaccessible or it is dynamic in nature or in worst case we get blocked or banned. A crawler should be always smart and focussed in what s/he wants and not to be misguided from the aim.
Checking the speed and load on web
Crawler bots are fast. It could parallely crawl a number of urls within a short span of time. And on requesting one url it traverse to the next and this process goes on until the destination is reached. This increases the load along with the network traffic on the site to a considerably high amount. Too much load on the site will result in suspicion and may end up blocked. Extra network traffic and load on website inturn make the site owners smell something fishy.
Possible solutions to this problems is by reducing the request rates per time or making a delay between each requests. Simultaneous page accesing could be reduced to 2 -3 pages per time.
Avoid potentially sensitive areas
Don't ever become a spammer. Always respect 'robots.txt'. Obey the nofollow and noindex constraints in robots.txt. Even the Google bot can't access some sites. Website owners use the robots.txt file to give instructions about their site to web robots whether the site wants to be crawled and how to crawl and index the pages on their website or hate being crawled.
Even if the robots.txt file allows you to crawl the data, there occur some traps that helps the site owner to distinguish whether the user is a bot or a human. While crawling requests are made by traversing each url in site, some responses may be redundant while some destination may not be accurate or it may be blank. Once we are caching the data, these redundant or unwanted request consumes memory .Traps like infinite looping to engage a request to be stuck in one place and presence of dummy urls is also set of other problems.Memory is wasted by this unwanted looping and irrelevant page parsing.
To defend crawling, anti-scraping tools are used by site owners. Eg:Botdefender.
Honeypot is a technique used by designers to put some links that only a crawler bot could fetch , not a normal user. Should make the bot recognise these traps or hence it'll be blacklisted.
Always be different
Robots follow the same crawling pattern. Sites that have intelligent anti-crawling mechanisms can easily detect spiders by finding pattern in their actions. Humans won't perform repetitive tasks. Incorporate some random clicks on the page, mouse movements and random actions that will make a spider looks like a human client.
Usage of different IP pool address
Initiating large requests from same IP creates suspicion and finally blocks the person. Once blocked, stop crawling with the same address. Several crawling services provide multiple IPs so that each request is made from a different IP. This creates an impression that requests are made from several systems.
Most sites wants its data to be crawled, only then they'll get its reach. But don't take too much more than deserved, a decent crawler should know his limits and reaches.