Insights from Our Experts

Blog image

Advanced Crawling Techniques

Author Image

Arun Augustine,Software Engineer

Web crawling is booming from being an evolving technology to become an important part of many businesses. The first crawlers were developed for a much smaller web, but today some of the popular sites alone have millions of pages. There are many processes for this, which is a combination of different levels of crawling. These levels could be systematically detailed as follows.


Selective Crawling        

Selective Crawling is the process to retrieve web pages based on some criteria. We are using a scoring function to determine relevant contents from a page. The fetched URLs are sorted according to a relevance score from the scoring function. Best-first search is the technique used to obtain pages with a high score first. This search leads to most relevant pages.

These are some of the examples of scoring functions and their criteria are given below


         Length of the path from the site’s homepage to the document

         Limit total number of levels retrieved from a site

         Maximize coverage breadth.


         Assign relevance according to which pages are more important than others

         Estimate the number of backlinks.


Focused Crawling

Focused Crawling is the process to fetch pages within a certain topic. Crawler will download pages that are related to each other. It collects documents which are specific and relevant to the given topic and classify crawled pages into categories. It determines how far the given page is relevant to the particular topic and how to proceed forward.

Focused Crawling is also known as Topic Crawler because of the way it works. It is economically feasible in terms of hardware and network resources, it can also reduce the amount of network traffic and downloads. The search exposure of focused web crawler is relatively huge.


Distributed Crawling

Distributed crawling is the process of partitioning of tasks or it is similar to distributed computing technique. A central server manages the communication and synchronization of the nodes, as it is geographically distributed. It basically uses PageRank algorithm for its increased efficiency and quality search. The advantage of distributed web crawler is that it is robust against system crashes and other events. It is more scalable and memory efficient. Also have increased overall download speed and reliability.


Incremental Crawling

Incremental Crawling is the traditional crawling or the process of prioritizing and revisiting URLs, in order to refresh its collection, periodically replacing the old documents with the newly downloaded documents. The incremental crawler incrementally refreshes the existing collection of pages by visiting them frequently; based upon the estimate as to how often pages change.

It also exchanges less important pages by new and more important pages. It resolves the problem of content consistency. The benefit of incremental crawler is that only the valuable data is provided to the user, thus network bandwidth is saved and data enrichment is achieved.


Parallel Crawling

Parallel Crawling is the process that runs multiple processes in parallel. That process are called C-procs which can run on network of workstations. The Parallel crawlers depend on Page freshness and Page selection. A Parallel crawler can be on local network or can be distributed at geographically distant locations. Parallelization of crawling system is very vital from the point of view of downloading documents in a reasonable amount of time.


Web Dynamics

Web Dynamics is the rate of change of information on the Web. It is mainly used by search engines for updating index.The index entry for a certain document, indexed at time t 0 , is said to be β- current at time t if the document has not changed in the time interval between t 0 and t − β. Basically β is a ‘grace period’: if we pretend that the user query was made β time units ago rather than now, then the information in the search engine would be up to date. A search engine for a given collection of documents is said to be (α, β)-current if the probability that a document is β-current is at least α.

According to this definition, we can ask interesting questions like ‘how many documents per day should a search engine refresh in order to guarantee it will remain (0.9,1 week)-current?’ Answering this question requires that we develop a probabilistic model of Web dynamics.