Essential Python Packages for the Web Scraping Toolbox in 2022

Essential Python Packages for the Web Scraping Toolbox in 2022
Photo by Artturi Jalli / Unsplash

Python is known for being able to do many different things due to the versatility of the programming language. One thing Python is particularly well-known for is its ability to receive data over the Web using the requests package (more information can be found here).

The requests package enables users to perform basic HTTP requests using methods such as GET, POST, etc. These methods can be used for sending and receiving data over the Web and for accessing various resources (e.g. JSON, CSV, HTML files).

Furthermore, requests support the ability to add parameters and header information to simulate the behaviour of a web browser such as Chrome or Firefox. In most cases, requests is used to receive the raw HTML document generated by a web server which, in turn, can be parsed to extract or "scrape" the relevant information. The combination of requesting and parsing a HTML document from a web server produces a web scraper.

While the requests package can do many things, it can't parse HTML documents. For this reason, a separate HTML parser is needed. This blog post focuses on four of the most widely-used and well-known Python packages for parsing HTML documents which are essential to your web scraping toolbox in 2022.

BeautifulSoup

Ideal for simple tasks

BeautfulSoup is the go-to package for many Python web scrapers. It's easy to install and get going plus, it integrates well with requests. The BeautifulSoup website says that "it commonly saves programmers hours or days of work" which is certainly the case when it comes to web scraping. It also supports many different types of parser (including HTML and lxml).

BeautfulSoup can be installed by running the following command...

$ pip install beautifulsoup4

Once installed, BeautifulSoup can be used to start scrapping. In this case, the following code is used to extract all hyperlinks (anchor tags) from a website.

import requests
from bs4 import BeautifulSoup

URL = 'https://dataground.io'

r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')

links = soup.find_all('a')
links = [l['href'] for l in links]

print(links)


Scrapy

Scrapy homepage

Ideal for large websites

Unlike BeautifulSoup, Scrapy is an entire web scraping framework designed for scraping data at scale as opposed to simple tasks like the one shown above. Scrapy provides a lot more in terms of functionality by comparison.

The Scrapy framework allows you to scrape data through the use of "web spiders" - a small script designed to collect data and traverse hyperlinks as and when they are discovered on the page. This is particularly ideal for websites/blogs which support pagination.

In addition to this, Scarpy web spiders can be deployed to Zyte Scrapy Cloud a service which allows you to upload and run your web scrapers in the cloud meaning that you can get on with other tasks as your websites are being scraped.

Pandas

Ideal for tabular data

As seen in a separate post, pandas can be used to retrieve CSV, HTML and JSON files from a remote. Although pandas isn't strictly considered a web scraper, it does have web scraping capabilities in the form of the read_html() function.

The read_html() method allows you to scrape the contents of data stored in a HTML table on a website. This is particularly ideal if you're working with tabulated data as it saves you the trouble of formatting the data later on.

Selenium

Selenium homepage

Ideal for dynamic websites

One of the problems with traditional web scraping, due to the way modern web applications are designed, is that many of the components rendered on the page are dynamically produced using Javascript executed at runtime.

This can be a bit of a challenge meaning that data can not be scraped without using a full web browser to render the HTML and execute any Javascript code running in the background. In this case, requests will fail as it is designed to retrieve static HTML.

This is where Selenium comes in.

Up till this point, HTTP requests have been made headlessly meaning that the raw HTML document is collected without being rendered. Selenium is designed to control a web browser by interacting with it via a driver using Python as the operator.

As a result of this approach, Python code can be written to retrieve data using the XPATH language as it is displayed in the web browser. This makes it possible to scrape data from dynamically-produced content.

Conclusions

In summary, this blog post mentions four essential Python packages for scraping data off the modern web. By using a combination of the tools mentioned in this blog post, you can build your web scraper depending on what you would like to archive.

For relatively simple and easy tasks, I would highly recommend using BeautifulSoup as it is easy to configure with very few lines of code. If you are scrapping data across multiple pages on a website (e.g. a blog or an online shop), scrappy would be your best option as it scales better and is well-suited for complex tasks and large websites. If you are scrapping tabular data, use the read_html() method built into Pandas. And finally, for dynamic websites dependent on Javascript, use Selenium.