Building Hashtag Co-occurrence Networks from Tweets

Building Hashtag Co-occurrence Networks from Tweets

Hashtags provide a meaningful way to allow users to annotate their tweets with labels that are likely to get picked up by other users. It's also a convenient way to discover communities of users based upon what hashtags they have in common.


In our case, we study the hashtags surrounding the recent Tokyo Olympics using a bipartite network . This is used to model a user's relationship with a particular hashtag - users on one end and hashtags on the other. The general rule is that an edge is formed by ' User A uses Hashtags B '. To get things going, we collect tweets that feature #Tokyo2020.

A simple example of this can be found below with users marked with '@' and hashtags with '#'.

Why a bipartite network?

By using a bipartite network we can project the graph such that hashtags are linked together based upon mutual users and vice versa. In other words, a projection is a representation of a bipartite graph where one set of nodes are linked together based upon mutual connections with the non-projected nodes. Using the same example below features the projected versions of the original network.

Hashtag co-occurrence networks are useful for modelling the similarity between certain hashtags in terms of their usage with other hashtags. Furthermore, similar techniques can be used to find similar uses based upon their choice of hashtags in tweets.

This technique also allows us to discover communities of users who tweet about similar topics.

Network Construction

Much like before, our scraping tool of choice is TWINT as this makes the job of collecting data much much easier. See below for other posts that use TWINT

To get things going, we can start to load in the scraped data collected from TWINT. For simplicity, only English tweets are included.

import pandas as pd

df = pd.read_csv('tokyo-tweets.csv')
df = df[df['language'] == 'en']

Next, we go through each of the tweets and collect the username of the user who created the tweet as well as the hashtags that were featured. We also remove certain hashtags that may cause conflict. In our case, we remove the original hashtag to avoid overcrowding the network.

For each of the hashtags mentioned in the tweet, forming the undirected edge between the user and the hashtag.

import networkx as nx
import ast

from networkx.algorithms import bipartite

to_remove = ['tokyo2020']

G = nx.Graph()
usernames = set()
hashtags = set()

for r in df.iterrows():
        # Get username
	username = r[1]['username']
	username = f'@{username}'

        # Process and format hashtags
	hts = r[1]['hashtags']
	hts = ast.literal_eval(hts)
	hts = [h for h in hts if h not in to_remove]
	hts = [f'#{h.lower()}' for h in hts]
        # Create edge
	for h in hts:
		G.add_edge(username, h)

Now that we've generated the network, we can export the data into an appropriate format. In our case, we'll use pandas to export the edge list as a CSV file.

df_edges = nx.to_pandas_edgelist(G)
df_edges.to_csv('tokyo-tweets_hashtags.csv', index=False)

Using networkx, we can make use of the weighted_projected_graph function which returns an undirected projected graph of the original bipartite network. We do this for both hashtags ...

G_hashtags = bipartite.weighted_projected_graph(G, hashtags)
df_hashtags = nx.to_pandas_edgelist(G_hashtags)
df_hashtags.to_csv('tokyo-tweets_hashtags_projection.csv', index=False)

and users ...

G_users = bipartite.weighted_projected_graph(G, usernames)
df_users = nx.to_pandas_edgelist(G_users)
df_users.to_csv('tokyo-tweets_users_projection.csv', index=False)

Depending on how large your graph is, this may take a while to process. If you're using a computer with lots of memory and processing power, this shouldn't take too long.