Scrapinghub is the most useful platform for web crawling. There are many services provided by Scrapinghub. It simplifies our data scraping efforts, and runs consistently and reliably. Scrapinghub allows us to deploy our spiders instantly and scale them on demand, without having to manage servers, backups or cron jobs. Everything is stored in a highly available database and retrievable using an API.
In the beginning of my career in scraping, I used to scrape all data locally. The process is easy when it comes to small amount of data to be scraped. But, In the case of bulk amount of data and continuous monitoring, we really need some service, so that we can deploy our code and run them with no interruption. Scrapinghub has helped me to overcome the issues. Now deploying code and monitoring them is easy. In some cases there could be some blockers from the server(Site which we crawl) like our IP is getting blocked or does not allow continuous request from the same IP. In this case scrapinghub played a vital role by means of their crawlera service, which helped me to get rid of most kind of server blockers.
Scrapinghub services simplifies our effort in deploying and running our Scrapy projects. Our spiders can be easily deployed from commandline. Ref.http://doc.scrapinghub.com/scrapy-cloud.html
We can schedule multiple spiders at a time in scrapinghub which will be executing parallely. We can easily check for errors and verify the items from the platform.
Crawlera, the weapon to break through the website’s anti-crawling technology. Crawlera has an IP pool and all these IPs are from different countries. If we enable crawlera, the server hitting IP addresses will be different every time. The IPs will be from crawlera's IP pool.
Another problem that can occur is banned websites. But crawlera has ban detection database, which contains list of ban types, captchas or status codes. So it will detect banned requests and redirect to a captcha page or return a non-200 response. It automatically retries to prevent bans and crawl smoothly.
Crawlera manages cookies for our requests and it retains for up to 15 mins. Consecutive requests will be having different cookies, because it keeps separate groups for cookies per outgoing node. It is a general proxy to be able to crawl sites without getting blocked for any HTTP client.
With Crawlera, we can download web pages with full safety. We need not have to worry about anything while crawling. It is all available for us through a simple HTTP API. We can configure it in our crawler of choice and start crawling.
I get great help and support from scrapy organization in python-scrapy, which I use to make spiders. The help from scrapinghub by means of their services make my process easy. I dedicate this blog to the scrapinghub team, as a token of thanks for the great help, support and guidance they have given me throughout.