How to Scrape a Twitter Timeline

How to Scrape a Twitter Timeline
Photo by Marten Bjork / Unsplash

In a separate blog post, we covered the basics for learning how to stream tweets from Twitter using Python. This is ideal if you are interested in collecting tweets in real-time as and when they are created but not ideal if you want to scrape tweets made in the past.

In this post, we cover the basics for learning how to scrape tweets from a user's timeline (that is, tweets created by a single user) using tweepy - a Python wrapper for interacting with the Twitter API.

Getting Started

Before starting, you need to generate your API keys from Twitter and I've downloaded and installed tweepy. Generating Twitter API keys is fairly straightforward and the tutorial can be found here. As for installing tweepy...

pip install tweepy

Setting up the API

Once installed, you will need to load in tweepy and fill in the API keys where appropriate. Note: Pandas has been included to help with exporting and modelling the data.

# Load in packages
import tweepy
import pandas as pd

# Set API keys
auth = tweepy.OAuthHandler('[TWITTER-APP-KEY]', '[TWITTER-APP-SECRET]')
auth.set_access_token('[TWITTER-OAUTH-TOKEN]', '[TWITTER_OAUTH-TOKEN-SECRET]')

api = tweepy.API(auth, wait_on_rate_limit=True)

...

As we initialise the API, it is important to set the wait_on_rate_limit parameter to True as this will ensure that we don't bump into any errors if we go over the API's rate limit. Intead, if we have reach the limit, the program will wait.

Collecting Tweets

Once the API has been configured, we will be able to start scraping tweets from a user's timeline. Because the API can only retrieve a total of 200 tweets at a time, we need to introduce an infinite loop and process tweets in chunks.

Thankfully we can keep track of where we are by using the max_id parameter which will only retrieve tweets prior to this tweet ID. This will ensure that we are scraping the entire timeline of the user and will give us as much coverage as possible.

For each iteration, all the tweets (stored as JSON objects) in the chunk are appended to a global list. Once we have reached the end of the timeline, the API will return nothing which in turn will then break us out of the infinite loop thanks to the if len(ts) == 0 line.

...

# Scrape the timeline
username = "[USERNAME]"

tweets = []

last_id = None

while True:
    try:
        ts = api.user_timeline(screen_name=username, count=200, max_id=last_id)
    except tweepy.errors.Unauthorized as e:
        return tweets
    except tweepy.errors.NotFound as e:
        return tweets
    
    if len(ts) == 0:
        break
        
    tweets.extend([t._json for t in ts])
    
    last_id = ts[-1].id -1
    
...

Exporting Tweets

Once we finished scraping all the tweets we can use the json_normalise function of pandas to convert our array of JSON objects into a tabulated dataframe which can then be exported to a CSV file using to_csv.

...

df = pd.json_normalize(tweets)
df.to_csv(f'{username}.csv')

Done!

Final Comments

In this short blog post, we cover the basics for scraping a user's timeline on Twitter. Thanks to Python and tweepy, tweets can be scraped quickly and easily with very little effort. I'm sure that the code featured in this post could be reworked or improved to better suit your needs. For example, as opposed to saving tweets to a simple CSV file, why not save them to a database?