SCRAPY -THE DATA COLLECTOR

Vaishnavi V
2 min readJul 30, 2020

--

For the analysis of data , a lot of data we need but everytime it is difficult to collect the data manually and type in the excel.

Scrapy is a free and open source web crawling framework. It is written in Python. From the word scrapy , we can guess it is used for scraping data.It can be used to extract data using APIs or as a general purpose web crawler.

Scrapy project architecture is built around “spiders” which are self-contained crawlers that are given as set of instructions. It also provides a web-crawling shell which can be used by developers to test their assumption on a site’s behaviour.

Installation of scrapy in Ubuntu

If you’re already familiar with installation of Python packages, you can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

Creating projects

The first step for creating your project is

scrapy startproject myproject [project_dir]

That will create a Scrapy project under the project_dir directory. If project_dir wasn’t specified, project_dir will be the same as myproject.

Controlling projects

Use the scrapy tool from inside your projects to control and manage them.Let’s create a new spider to crawl:

scrapy genspider mywebsite mywebsite.com

Advantage of Scrapy over Beautiful Soup

Scrapy is a tool specifically created for downloading, cleaning and saving data from the web and will help you end-to-end , whereas BeautifulSoup is a smaller package which will only help you get information out of webpages.

Is scrapy faster than Selenium?

Scrapy only visits the url which we mention in terminal, but Selenium will control the browser to visit all js file, css file and img file to render the page, that is why Selenium is much slower than Scrapy when crawling.

Using scrapy I collected data from Indeed website , I also suggest you to use scrapy when you want to collect large amount of data from website for analysis.

Scrape yourself daily to learn new tools !!!

--

--