Python is known for being able to do many different things due to the versatility of the programming language. One thing Python is particularly well-known for is its ability to receive data over the Web using the
requests package (more information can be found here).
requests package enables users to perform basic HTTP requests using methods such as GET, POST, etc. These methods can be used for sending and receiving data over the Web and for accessing various resources (e.g. JSON, CSV, HTML files).
requests support the ability to add parameters and header information to simulate the behaviour of a web browser such as Chrome or Firefox. In most cases,
requests is used to receive the raw HTML document generated by a web server which, in turn, can be parsed to extract or "scrape" the relevant information. The combination of requesting and parsing a HTML document from a web server produces a web scraper.
While the requests package can do many things, it can't parse HTML documents. For this reason, a separate HTML parser is needed. This blog post focuses on four of the most widely-used and well-known Python packages for parsing HTML documents which are essential to your web scraping toolbox in 2022.
Ideal for simple tasks
BeautfulSoup is the go-to package for many Python web scrapers. It's easy to install and get going plus, it integrates well with requests. The BeautifulSoup website says that "it commonly saves programmers hours or days of work" which is certainly the case when it comes to web scraping. It also supports many different types of parser (including HTML and lxml).
BeautfulSoup can be installed by running the following command...
$ pip install beautifulsoup4
Once installed, BeautifulSoup can be used to start scrapping. In this case, the following code is used to extract all hyperlinks (anchor tags) from a website.
import requests from bs4 import BeautifulSoup URL = 'https://dataground.io' r = requests.get(URL) soup = BeautifulSoup(r.text, 'html.parser') links = soup.find_all('a') links = [l['href'] for l in links] print(links)
Ideal for large websites
Unlike BeautifulSoup, Scrapy is an entire web scraping framework designed for scraping data at scale as opposed to simple tasks like the one shown above. Scrapy provides a lot more in terms of functionality by comparison.
The Scrapy framework allows you to scrape data through the use of "web spiders" - a small script designed to collect data and traverse hyperlinks as and when they are discovered on the page. This is particularly ideal for websites/blogs which support pagination.
In addition to this, Scarpy web spiders can be deployed to Zyte Scrapy Cloud a service which allows you to upload and run your web scrapers in the cloud meaning that you can get on with other tasks as your websites are being scraped.
Ideal for tabular data
As seen in a separate post, pandas can be used to retrieve CSV, HTML and JSON files from a remote. Although pandas isn't strictly considered a web scraper, it does have web scraping capabilities in the form of the read_html() function.
The read_html() method allows you to scrape the contents of data stored in a HTML table on a website. This is particularly ideal if you're working with tabulated data as it saves you the trouble of formatting the data later on.
Ideal for dynamic websites
This is where Selenium comes in.
Up till this point, HTTP requests have been made headlessly meaning that the raw HTML document is collected without being rendered. Selenium is designed to control a web browser by interacting with it via a driver using Python as the operator.
As a result of this approach, Python code can be written to retrieve data using the XPATH language as it is displayed in the web browser. This makes it possible to scrape data from dynamically-produced content.
In summary, this blog post mentions four essential Python packages for scraping data off the modern web. By using a combination of the tools mentioned in this blog post, you can build your web scraper depending on what you would like to archive.
For relatively simple and easy tasks, I would highly recommend using BeautifulSoup as it is easy to configure with very few lines of code. If you are scrapping data across multiple pages on a website (e.g. a blog or an online shop), scrappy would be your best option as it scales better and is well-suited for complex tasks and large websites. If you are scrapping tabular data, use the