Web crawling is the extraction of needed data from websites. It may look simple while fetching data from the page source but it is more challenging when the page source has majority of contents which could not be fetched from HTML code.  

Static web content generally refers to web content that is fixed and not capable of action or change. A web site that is static can only supply information that is written into the HTML source code and this information will not change unless the change is written into the source code. When a web browser requests the specific static web page, a server returns the page to the browser and the user only gets whatever information is contained in the HTML code. In contrast, a dynamic web page contains dynamically-generated content that is returned by a server based on a user's request. The user can request for the information, which can be retrieved from the database based on the user input parameters.

 

Some of you might have noticed that some sites may not appear same as the contents that inspect element displays. In such cases most of the contents are written in AJAX or scripts.Crawling such contents are difficult using scrapy since the response we get is purely the page source.These contents can be fetched in different methods.

    

Mostly the data may be json or xml type.We could retrieve the data from the network tab in web browser to see all the informations about the requests and responses.You could filter the response down to XHR which gives the request made by the javascript code.

 

In some other cases the request might be of POST method type.Could see that one HTTP request is responsible for the response body. In this situation,for getting the correct response, we should supply the needed headers and form data for getting the response contents.Cookies has to be given if needed for request urls.Proper supply of parameters and method could give the proper responses.The response may be a json data or html code.


In dynamic pages contents written in ajax or javascripts,using Scrapy along with web testing framework Selenium could be used to crawl anything displayed in normal web browser.

 

Points to Remember :

  • Must have the python version of Selenium RC installed and it has to be setup properly. So, while requesting for a url in crawling, two requests are placed, one by Scrapy and another by Selenium.
  • This is quite interesting when you have the whole data that you can see in a normal page .But this feature may slower the crawling process as we get a whole bunch of data in response.

Every scraping process is unique. There are various approaches to get responses via scrapy and above are the few we have described. Scrapy makes it easier to build and scale large crawling projects by allowing developers to reuse their code. There are even more methods that could be incorporated along with scrapy to feel its magic.

Scrapy is the most popular web scraping framework for Python and it makes writing spiders a lot easier. With a couple of commands you can create a new spider and begin adding logic to extract data from the response.

 

 

Subscribe to our newsletter. Get updates on awesome happenings in the technology world!
Newsletter subscription successfull! Something went wrong! Email is already registered!

TAGS