How to Scrape and Extract Hyperlink Networks with BeautifulSoup and NetworkX

How to Scrape and Extract Hyperlink Networks with BeautifulSoup and NetworkX
Photo by Edge2Edge Media / Unsplash

If you use the internet often enough (which, if you're reading this blog post, I imagine you), you will soon realise that hyperlinks are everywhere you go. They are a fundamental complete to the world wide web. They allow us to navigate through web pages with a series of simple clicks.

If you've been on this blog before, you'll know that hyperlinks and web pages make great networks . The general idea is that the websites are nodes and the hyperlinks connecting them form the edges.

Here is what a simple hyperlink network, may look like. To make things easier, we will only crawl pages on the same website. That means only considering a link if it has the same domain. Otherwise, we might as well just crawl the entire web ;)

Hyperlinks determine how a user navigates around a website. They are used to allow the user to access certain information when clicked. The positioning of hyperlinks is important as they determine how well a user can find the material they need. Therefore, highly-connected hyperlink networks suggest that information can be accessed with ease using very few clicks.

For example, the use of tags (simple keywords which describe a post) allows users to access relevant blog posts under a specific topic. The same can be said for blog posts that link to other relevant blog posts. This is particularly important for SEO (search engine optimisation) as it allows you to construct a website in such a way that information is easily accessible.

Scraping with Python

Now that we know, why hyperlink networks are important, we can begin to extract the data we need using a simple Python script. The basic idea is this...

  1. Create an empty graph
  2. Visit Homepage
  3. Find all HTML a tags and select those that are within the same website
  4. Create edge between current link and next
  5. Visit next link
  6. Repeat...

To get things, going we need to install a few packages that will help us make the HTTP requests and to parse the HTML document. If you haven't done so already, you need to install the following packages:

  • networkx : For modelling the networks (if you don't know this by now ;) )
  • pandas : For importing and exporting dataframes
  • requests : For connecting to the website over HTTP
  • BeautifulSoup : we use this to parse raw HTML documents so we can extract links with ease
$ pip install networkx pandas requests beautifulsoup4

Now that these are installed, we can begin to put some code together. To start, we need to import our packages...

import requestsimport networkx as nximport pandas as pdfrom bs4 import BeautifulSoup

We will also need to initialise an empty directed graph. That's quite easy to do...

G = nx.DiGraph()

Now that the packages have been loaded in, we can start collecting links from our website. To do this we need to collect all a tags embedded in the website from the initial/starting point domain or "root domain". This is achieved using BeautifulSoup with the following code to extract all a tags found on the website...

domain = 'connectingfigures.com'url = f'https://{domain}/' req = requests.get(url) soup = BeautifulSoup(req.content, 'html.parser') links = soup.find_all('a')

Now that we have extracted all of the links from the website, we need to filter the links for the following issues:

  • Check link contains 'href' attribute : In order to extract hyperlinks, the a tag needs a valid ' href ' as this attribute is used to store the URL of the linked page.
  • Remove '#' from link : Often enough, the # is used to redirect the user to a certain part of the page (e.g. the article section). In our case, we remove this and only select the main link.
  • Check if the link is not internal : To ensure that we don't crawl the entire web, we need to make sure that we only collect links that are local to our website of interest.
  • Ensure that link is not the same as the current : To keep things simple, we exclude links which point to themselves

Using these rules leaves us with the following code...

# Find all 'a' tagslinks = soup.find_all('a') # Check for 'href'links = [ln.get('href') for ln in links if ln.get('href')] # Remove page jumpslinks = [ln.split('#')[0] for ln in links] # Keep internal siteslinks = [ln for ln in links if domain in ln] # Not the samelinks = [ln for ln in links if ln != l] # Remove duplicateslinks = set(links)

Using these links we can create an edge between the current link and the new links. This can be done with a simple for loop...

for link in links: G.add_edge(url, link)

Now that we've got a set of new links, we need to introduce a queue-like system to process links one by one recursively until we've processed all the hyperlinks on the site. This can be done using the built-in Python list data structure. We'll create one called queue to contain a list of links that are yet to be visited and another list called processed to keep track of what sites we have already done. This avoids duplicating work that we have already done.

processed = [] # Start with the 'root'queue = [url]

Without going into too much of the theory behind this, we are essentially performing what is known as a breadth-first search across all links on the site. This means that we go through each of the links one-by-one. When we find a new link we add it to the queue and when we have finished exploring a site, we add it to the processed list. All this is wrapped in a while loop such that we keep going through the queue until there is nothing left.

Putting this all together gives us...

import requestsimport networkx as nximport pandas as pdfrom bs4 import BeautifulSoup domain = 'dataground.io'url = f'https://{domain}/' processed = [] queue = [url] G = nx.DiGraph() while queue: l = queue.pop(0) req = requests.get(l) soup = BeautifulSoup(req.content, 'html.parser') links = soup.find_all('a') links = [ln.get('href') for ln in links if ln.get('href')] links = [ln.split('#')[0] for ln in links] links = [ln for ln in links if domain in ln] links = [ln for ln in links if ln != l] links = set(links) to_add = [ln for ln in links if ln not in queue] to_add = [ln for ln in to_add if ln not in processed] queue.extend(to_add) for link in links: print((l, link)) G.add_edge(l, link) processed.append(l)

EDIT 1 : Now that we've got the graph, we can save it as a pandas data frame...

df = nx.to_pandas_edgelist(G) df.to_csv('hyperlinks.csv', index=False)

Demonstration

So, to prove that this blog post was worth publishing, I thought I would try it out on this site. Bearing in mind, this site is still relatively new as of this writing so there is not much content compared with other blogs.

Note : This is not the most efficient algorithm so if you pick a large website it could take a while to complete.

Nodes are coloured according to different page types: tags are in blue, categories in yellow and blog posts in green. The nodes are sized according to PageRank to find the most important links.

Final Thoughts

The technique we used in this blog post provides an interesting way of finding valuable pages on a website. This is particularly important for those invested in SEO as it provides a way of understanding how your website or blog is structured internally. Also, by cross-referencing relevant blog posts on your site, you can help promote important posts by directing readers to all material they need to know.

It's fair to say that the code used in this post is not perfect as there is a fair about of work that can be done to optimise it. For example, multiple threads could be used to extract hyperlinks concurrently. It also might be worth implementing it in a programming language like Go which is much faster and more efficient with data handling and memory management.