Searching Twitter from Tweepy is a lot less troublesome than working with lists. Let’s explore the search API of Twitter and the V2 client of Tweepy.
This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.
3 things you must know about Twitter search
Twitter allows you to search recent tweets only (posted in the last 7 days). When you need to go back further, you need an academic research account and Twitter does not give them out easily. Expect a much longer approval process than for your developer account – and a much higher decline rate.
The search API is built to handle a huge number of requests and it is heavily optimized to reduce the amount of data it needs to transfer. Therefore, by default you only get a minimalistic set of data about a tweet back.
For most use-cases you need additional data. Luckily, we can tell the search API exactly what data we want. This flexibility comes with the price that you need to learn what options you can use. Otherwise, you may not be able to do something useful with the search results.
The V2 client
For search the V2 API client works very well. We only need the bearer token to search Twitter:
1 2 3 4 5 6 7 8 9 |
import tweepy import os from dotenv import load_dotenv load_dotenv() bearer_token = os.getenv('bearer-token') client = tweepy.Client(bearer_token) |
The rate limit for the search endpoint is between 500’000 and 1’000’000 requests per month. If you reach that limit you better stop your application, waiting on the reset may be a colossal waste of energy.
Search Twitter (basic)
The basic search without any additional fields works with this code snipped:
1 2 3 4 5 6 7 |
# Search recent tweets (last 7 days) response = client.search_recent_tweets("#Python Friday 112",max_results=15) tweets = response.data for tweet in tweets: print('-' * 50) print(f"{tweet.id} ({tweet.created_at}) - {tweet.author_id}:\n{tweet.text}\n") |
When I run this code, I got these results back:
————————————————–
1499834381367746561 (None) – None:
RT @j_graber: #Python Friday #112: How to Use #Tweepy in #Flask https://t.co/fNNOXPB9GI————————————————–
1499834217802387456 (None) – None:
RT @j_graber: #Python Friday #112: How to Use #Tweepy in #Flask https://t.co/fNNOXPB9GI————————————————–
1499834195820036098 (None) – None:
RT @j_graber: #Python Friday #112: How to Use #Tweepy in #Flask https://t.co/fNNOXPB9GI————————————————–
1499822223338655756 (None) – None:
#Python Friday #112: How to Use #Tweepy in #Flask https://t.co/fNNOXPB9GI
The basic result set has the Tweet id and the text. Everything else from author to when the Tweet was posted is empty.
Tailor the result
In the endpoint documentation you find all the Enum values you can use to tailor the representation of the search results. There is a good explanation on how annotations work and how you can use fields and expand them.
Here are a few combinations I find most useful:
tweet_fields | author_id | The numerical Id of the tweet author |
created_at | When the Tweet was posted | |
public_metrics | Counter for retweets, likes, quotes and replies | |
attachments | References for media files (images, videos) | |
user_fields | username | The screen name (@…) |
name | The name of the user | |
profile_image_url | The URL of the profile picture | |
media_fields | public_metrics | Metrics on the attachment |
url | URL of the media | |
height | Height of the media | |
width | Width of the media | |
alt_text | The alternative text | |
expansions | author_id | Must be set to get user objects |
attachments.media_keys | Must be set to get media fields |
Structure of the search result
It is important to know that the extra data you request will not be part of the Tweet object. Instead, users and media objects will be in the includes dictionary of the response:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
Response( data=[ <Tweet id=... text='...'>, <Tweet id=... text='...'>, <Tweet id=... text='...'>, <Tweet id=... text='...'>, <Tweet id=... text='...'> ], includes={ 'users': [ <User id=1... name=... username=...>, <User id=2... name=... username=...>, <User id=3... name=... username=...>, <User id=4... name=... username=...>, ], 'media': [ <Media media_key=3_1500146... type=photo>, <Media media_key=3_1500069... type=photo> ]}, errors=[], meta={ 'newest t_id': '...', 'oldest_id': '...', 'result_count': 10, ' next_token': '...' } ) |
Be aware that the dictionaries may be empty if your search result does not contain media objects.
Search Twitter (expanded)
If we put everything from above together, we can search for the hashtag #VisitOslo and try finding some pictures:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
response = client.search_recent_tweets( "#VisitOslo", max_results=10, expansions="author_id,attachments.media_keys", tweet_fields="created_at,public_metrics,attachments", user_fields="username,name,profile_image_url", media_fields="public_metrics,url,height,width,alt_text") # process users users = {} for user in response.includes['users']: # print(user.username) # print(user.name) users[user.id] = f"{user.name} (@{user.username}) [{user.profile_image_url}]" print(users[user.id]) # print(dir(inclu)) # process media attachment media = {} for item in response.includes['media']: media[item.media_key] = f"{item.url} - {item.height}x{item.width} - Alt: {item.alt_text}" print(media[item.media_key]) tweets = response.data # The expanded tweet offers a lot more data for tweet in tweets: print('-' * 50) print(f"{tweet.id} ({tweet.created_at}) - {users[tweet.author_id]}:\n {tweet.text} \n") metric = tweet.public_metrics print(f"retweets: {metric['retweet_count']} | likes: {metric['like_count']}") if tweet.attachments is not None: for media_key in tweet.attachments['media_keys']: print(f"Media attachment: {media[media_key]}") |
As I run the search, I got an interesting set of results (cut to the two important tweets):
#ThePhotoHour (@ThePhotoHour) [https://pbs.twimg.com/profile_images/969225604351578113/LS5Yis-q_normal.jpg]
Ragnhild Aarseth (@Riamolde) [https://pbs.twimg.com/profile_images/1344351622/RAA_normal.jpg]
https://pbs.twimg.com/media/FNGXbMvX0AYi0uq.jpg – 960×720 – Alt: None
————————————————–
1500159252580777989 (2022-03-05 17:20:12+00:00) – #ThePhotoHour (@ThePhotoHour) [https://pbs.twimg.com/profile_images/969225604351578113/LS5Yis-q_normal.jpg:
RT @Riamolde: Winter street in #Oslo #norway #visitnorway #winterwonderland #winterinnorway #visitoslo #StormHour #ThePhotoHour https://t.c…retweets: 4 | likes: 0
————————————————–
1500146647099219969 (2022-03-05 16:30:06+00:00) – Ragnhild Aarseth (@Riamolde) [https://pbs.twimg.com/profile_images/1344351622/RAA_normal.jpg:
Winter street in #Oslo #norway #visitnorway #winterwonderland #winterinnorway #visitoslo #StormHour #ThePhotoHour https://t.co/F27Ao2Yh3jretweets: 4 | likes: 21
Media attachment: https://pbs.twimg.com/media/FNGXbMvX0AYi0uq.jpg – 960×720 – Alt: None
The includes part of the response got 2 authors and one media file. If you take a closer look at the tweets, you see that the first tweet is a retweet of the second, older tweet. However, only the second tweet contains a media attachment. This is the case because in the search I did not ask to expand the retweeted tweets and so I did not get that data back.
Next
This post shows how we can tailor the search results to exactly match our requirements. Unfortunately, that means we have to learn a whole lot to get what we need. Next week we try to stream search results in real-time as those tweets get posted.
Great work! Thanks.
What if we want to get exactly the same result as above (tweets and associated user data) but for more than 100 tweets? I know pagination and flattening can be used, but this doesn’t include user data. How can we include user data specially that search_recent_tweets doesn’t work witg next token?