Insights from Our Experts

Blog image

A peep into crawling

Author Image

Heera Hariharan,Software Engineer

A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner.This process is called Web crawling or spidering.

As comparing our web crawler to the original spider web's, its simple to say that it’s an elegant trap method.Well only difference is that web crawler is for collecting data from sites and other obviously for getting juice out of its prey . 

Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data.Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches.

 

A great Example - GoogleCrawl or Google bot

Google Crawling

 

Google manages to find and organize information on the web in what is billions and billions of pages (no wonder google is everyone's favorite search engine) a search engine deploys software typically referred to as a spider (or a crawler or a bot). By using spiders to crawl web pages, search engines are able to identify keywords, to consume them into their indexes, to rank them and to create links back to the web pages where it all exists.

 

Scrapy 

Scrapy Logo

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.It is open source and written in Python.It is a model for the spider to control crawling action.

It is basically controlled by using command line tools that can be used to trigger the scrapers in python.

 

Scrapy installation

Scrapy requirements:

  • Python 2.7
  • Works on Linux, Windows, Mac OSX, BSD

There are different ways to install scrapy

  1. pip install Scrapy (make sure python is installed in your system)
  2. Download and install an official release.
  3. Installing with easy_install

To create a crawling project

scrapy startproject <project_name> (in terminal)

 

Spider

Spiders are the heart of crawling. They are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).


BaseSpider is a basic crawling class.So to create a Spider, you must subclass from scrapy.spider.BaseSpider and define the mandatory attributes which are needed to be crawled by scrapy.
For running the spider, -- scrapy crawl <name>   

(name is a mandatory attribute in spider)

You could try an example code for crawling from here.