In this two-part series post, we cover two ways to hydrate tweets for data analysis. I thought it was best to split this into two parts covering the easy way and the slightly harder way to archive the same outcome. This will help us to learn the theory behind how this all works. In this post, we will start the easy way by using Hydartor - a simple desktop GUI app for hydrating tweets for you.
Before we get started, it's worth asking what exactly it means to hydrate tweets.
What does hydrate mean?
This is a good question. Hydrating a tweet refers to the process of simply recovering the original data from the Twitter API using only the tweet ID. Thankfully, this is easy enough to do as the API has a lookup feature for doing exactly this.
Why do we hydrate tweets?
If you've ever worked with Twitter data before, you'll soon realise that the API gives you a ton of information - take a look here for yourself. Often enough if someone wants to share a set of tweets with you, rather than sharing the entire data set they will only provide a list of tweet IDs. There are several reasons why people may want to do this.
Firstly, there may be regulations in place (e.g. ethical constraints) that may prevent you from releasing the full contents of the data. This may be because it includes information that they do not wish to share publicly (e.g. graphic content, user privacy e.t.c).
Secondly, it decreases the file size by reducing a tweet to its most essential component, the ID. This makes it easier to distribute and uses less bandwidth to send.
Finally, it's easier to create and share a list of IDs than it is the full tweets. As a result, this means that you don't have to have gigabytes worth of tweets sitting on your desktop that you need to share with others.
I'm sure there are many other reasons why but these are the ones I thought of at this point in time.
So, how do I hydrate tweets?
Before we learn how to hydrate tweets, it's important to have an idea of how the Twitter API works. As we said earlier when you hydrate a tweet you are essentially retrieving the original tweet using its ID. This is done using a simple lookup with the API.
While it may appear fairly straightforward it does come with a catch.
Firstly, to hydrate tweets, we need access to a Twitter account. If you already have one, that's great! You're ready to move on. If not you'll need to go ahead and create one yourself.
Secondly, the API only allows us to make so many requests per minute. Fortunately, as of this writing, they are fairly generous with the number of requests you can do but it does mean you have to factor in some sort of delay once you reach your limit.
Currently, with the lookup API, you can make up to 900 requests per 15-minute window and for every request, we can retrieve up to 100 tweets. So doing the maths, this means that you can get 900 * 100 = 90,000 tweets every 15 minutes! That's quite a lot! Although, some data sets I've worked with before can contain millions of tweets 😳.
For this reason, it is important to take this 15-minute window into account such that we don't go over our limit.
To get things going, go ahead and download the Hydrator App from https://github.com/DocNow/hydrator
Once installed, you'll be presented with a rather friendly GUI where you will be prompted to login with your Twitter account. You'll then be prompted to add a new dataset. All you need to do is provide a list of tweet IDs (a CSV file will work fine) and the title for the set. I've called mine "COVID-19 Tweets".
As soon as you have added the IDs file and title, you can safely select the "Add Dataset" Button.
Once that is done, you'll be informed as to how many tweets need to be processed. To get things going, just press "Start" and Hydrator will ask you where you would like to save the JSON file which will contain the raw "hydrated" tweets.
After this, Hydrator will take care of everything else for you - it will even manage the rate limitations enforced by Twitter on your behalf. How nice of them :)
And that's it!
Over time, your tweet IDs will be replaced with the original tweets although, it is important to factor in that this may take some time depending on how large the dataset is. As I said before, some sets can contain literally millions of tweets. We will also need to consider the 15-minute window enforced by Twitter as mentioned above.