Today's world scraping data from websites is not so complicated as it used to be. We are using these scraped data for many purposes. There are many tools available for doing this job. In some extream cases, we need to fetch data (text characters) from the image itself. There are many tools available on the internet, Scrapy along with PyTesseract is one of the best combos we can work with.

Using Scrapy we can get whatever images from the internet and give this images as the input to PyTesseract. Pytesseract gives the text contents of the image as text data.we can simply use this data.  

 

Usage of Scrapy:

Scrapy is an open source web crawling framework, designed for web scraping. It crawls websites and extracts structured data from their pages. We can write a spider for getting images from the website.

If we need to fetch data from the websites whose pages are rendered using javascript templates or the web page that loads dynamically. Then we can use other packages like

  • ScrapyJS -  A library provides Scrapy and JavaScript integration using Splash.
  • Selenium -  A WebDriver using a real browser.
  • Beautiful Soup -  Python library to get data out of HTML and XML files.

 

Usage of PyTesseract:

PyTesseract is an Optical Character Recognition module for Python. It uses a Tesseract OCR engine. It takes image or image file as an input and output a string.

#Install pytesser
pip install pytesseract
#Install Pillow

Pillow, is a Python Image Library. We can use Pillow to open an image to feed pytesseract.

pip install Pillow

 

An example code to fetch data from the image.

# Opening the image along with specific path.
image  = Image.open(image_file_name)
# Converting the image to string.
text_data = image_file_to_string(image_file_name)
text_data = image_file_to_string(image_file_name, graceful_errors=True)
# Writing the string data to a text file. 
text_file = open(text_file_path, "w+")
text_file.write("%s" % text_data)
text_file.close()

 

Example text fetched from image:

Scrapy

- Fast, simple and extensible Web scraping

framework in Python

Currently compatible only with Python 27

In-progress Python 3 support

Maintained by Scrapi nghub

BSD License

MW

Subscribe to our newsletter. Get updates on awesome happenings in the technology world!
Newsletter subscription successfull! Something went wrong! Email is already registered!

TAGS