In the previous blog post , we learned how to use PRAW to scrape and process data from Reddit. We finished off by looking at how to collect top-level comments from a post submission and briefly mentioned how to collect replies. However, due to the complexity of modelling nested replies, I thought it would be best to create this blog post to better explain how it all works and how we can generate reply networks.
What is a reply network?
First of all the most important question is, what exactly is a reply network. That's a good question. A simple answer is a reply network is a graph that shows which user replied to other users. In practical terms, this can be represented by a simple directed graph where the direction of the arrow represents who replied to who. Simple right?
Why use reply networks?
If you've used read it before, or indeed any other forum, you know that comments and replies produce a hierarchical structure where the depth indicates the length of a conversation between a group of users. This representation can make it quite hard to understand conversational dynamics between users. Due to the hierarchical tree-like structure of the discussion thread, it can become quite hard to follow the exchanges between users as the conversation builds over time. This has a recursive effect where the number of notes could in theory exponentially grow over time. This is not ideal as that makes things harder to process.
By using a reply network we are essentially collapsing the discussion down and building a simple network representation where a node represents a user and a directed edge represents the direction of reply to a user. We can also include features like when the reply was made (timestamp) and how many times a user made a reply to a certain individual. They allow us to understand who talks to each other the most and find meaningful connections. They can also be used to understand how conversations are formed.
Now that we know the theory behind the concept of reply networks, we can begin to start scraping our data using PRAW. First of all, we need to import the packages we need to use and initialise the API.
import praw import networkx as nx CLIENT_ID = "[YOUR ID KEY HERE]" CLIENT_SECRET = "[YOUR SECRET KEY HERE]" USER_AGENT = "[YOUR USERNAME HERE]" reddit = praw.Reddit( client_id=CLIENT_ID, client_secret=CLIENT_SECRET, user_agent=USER_AGENT, ) ...
To get things going, we need a starting point. In my case, I going to start things off by getting the first 'hot' post from the r/worldnews subreddit and initialising
MultiDiGraph . Note : we are creating a multigraph purely to factor in the possibility of multiple edges between the same pair of users.
subreddit = reddit.subreddit('worldnews') for submission in subreddit.hot(limit=1): ...
This gives us the
submission variable we need to access the comments.
Depending on the popularity on the popularity of the post, many people may decide to leave comments and reply to others. For this reason, we need to expand the comments using the
replace_more method to ensure that we can access all of the top-level comments.
Once we have done that we can add them to the queue - a list of comments we need to expand to collect replies.
submission.comments.replace_more(limit=None) comment_queue = submission.comments[:] ...
Once we have populated the queue with the initial top-level comments, we need to go through them one by one and search for replies. This can be achieved with a simple while loop. For each of the comments in the queue, we need to start by checking that both the user and parent user (the user that is receiving the reply) exist.
while comment_queue: comment = comment_queue.pop(0) c1 = comment c2 = comment.parent() if not c1.author: continue if not c2.author: continue ...
If they both exist we can proceed to get the full usernames and create a connecting edge between the two. We can also add additional information to the edge such as the date it was created, the karma score and the ID.
And essentially we repeat this process until all the comments have been processed.
Please note that this may take some time depending on how large the discussion thread is. After all, this isn't exactly the most efficient algorithm and it only serves as a working proof of concept for this blog post.
As mentioned earlier this is only a working proof of concept and there are many things that can be done to make it more efficient - such as concurrent processes. There is definitely more we can add to our algorithm but I'll leave that as something for you to do for now.
Overall in this blog post, we went through the process of enumerating all the comments and replies of submission to generate a reply network.