SCRAPY -THE DATA COLLECTOR
For the analysis of data , a lot of data we need but everytime it is difficult to collect the data manually and type in the excel.
Scrapy is a free and open source web crawling framework. It is written in Python. From the word scrapy , we can guess it is used for scraping data.It can be used to extract data using APIs or as a general purpose web crawler.
Scrapy project architecture is built around “spiders” which are self-contained crawlers that are given as set of instructions. It also provides a web-crawling shell which can be used by developers to test their assumption on a site’s behaviour.
Installation of scrapy in Ubuntu
If you’re already familiar with installation of Python packages, you can install Scrapy and its dependencies from PyPI with:
pip install Scrapy
Creating projects
The first step for creating your project is
scrapy startproject myproject [project_dir]
That will create a Scrapy project under the project_dir
directory. If project_dir
wasn’t specified, project_dir
will be the same as myproject
.
Controlling projects
Use the scrapy
tool from inside your projects to control and manage them.Let’s create a new spider to crawl:
scrapy genspider mywebsite mywebsite.com
Advantage of Scrapy over Beautiful Soup
Scrapy is a tool specifically created for downloading, cleaning and saving data from the web and will help you end-to-end , whereas BeautifulSoup is a smaller package which will only help you get information out of webpages.
Is scrapy faster than Selenium?
Scrapy only visits the url which we mention in terminal, but Selenium will control the browser to visit all js file, css file and img file to render the page, that is why Selenium is much slower than Scrapy when crawling.
Using scrapy I collected data from Indeed website , I also suggest you to use scrapy when you want to collect large amount of data from website for analysis.