GetOldTweets3
GetOldTweets3 copied to clipboard
Too Many Requests
I wonder if there is a way to break down the download into pieces and pause between two pieces to avoid the "Too Many Requests" error? I am getting tweets for one highly used word, and I want to break it into batches of 10,000 tweets and pause in the between batches.
I am sadly having the same issue. It used to work great but not anymore. Any solution from the more experienced coders?
I'm having the same issue. Is there a way to make the code sleep in order to pace the number of requests?
Same issue here too, I couldn't find a feature that allows the scraping process to sleep.
Would using a proxy do the trick? if so how? I tried to set a proxy on pycharm and on the general setting but no luck.
I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.
I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.
The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period. Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.
I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.
The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period. Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.
It's not even working with 10k now :/ I'm guessing Twitter's team have put in countermeasures.
I tried to alter the source code with time.delay but got a different error:
An error occured during an HTTP request: HTTP Error 503: Service Temporarily Unavailable
I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.
The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period. Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.
Since they are mimicking the advanced search on Twitter, the smallest time unit is a day -- you can see the options included in twitter's advanced search by checking out the website.
For larger than a day, it is easy to loop through days and have it sleep. I couldn't find the download rate limit for Twitter's server. However, when I had it sleep 16 minutes, between two days, It seemed to recover from the Too Many Request error.
My problem is for when I download one day of the Tweets for a common word. I think it doesn't have much to do with the Advanced Search functionality anymore, which is a good thing. Similar to max tweets there should be a way to cap the downloads, but we want to keep the search alive --maybe that's the issue in your case @lethalbeans? and have a placeholder for resuming the download after sleep.
@elkalhor thanks I've managed to get 20k so far, aiming for 100k at least.
Yeah that would be good. The current code seems to use try catch exceptions, and ends the code abruptly if the json request is not fulfilled. I'm currently trying to implement ratelimit:
https://github.com/tomasbasham/ratelimit
I would propose to change the current exiting of the script to a unified exception.
That way the user can decide to catch the exception and use any method to retry the request later on for the moment.
Using sys.exit()
without any error code, seems like the worst way to handle it.
Besides that, the approach from @lethalbeans seems like a really good idea to me. Do you have progress on that?
@ekalhor, this should do the trick
https://github.com/Mottl/GetOldTweets3/issues/3#issuecomment-527642499
@lethalbeans did you get ratelimit
to work?
has anybody found a solution for this? I get the same error when I get 15000 to 20000 tweets
I used ratelimit to solve the 429 problem, but I eventually got another error:
An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>
To use rate limit:
from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60
@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...
I tried it and still got the "too many requests" could you please share your complete code? (I mean the callAPI function)
I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py
Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).
I used ratelimit to solve the 429 problem, but I eventually got another error:
An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>
To use rate limit:
from ratelimit import limits, sleep_and_retry ONE_MINUTE = 60 @sleep_and_retry @limits(calls=30, period=ONE_MINUTE) def callAPI...
Has anyone solved the certificate expired error or know what's causing it? I can't tell if it's a time limit-related issue or something else. I'm also getting the certificate expired error, but it has happened after 700 tweets, 4100 tweets, 0 tweets...
An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1045)>
Hi @libbyh @meixingdg,
I have been able to solve the SSL certificate with the following two lines at the beginning of the TweetManager.py
file:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
I used ratelimit to solve the 429 problem, but I eventually got another error:
An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>
To use rate limit:
from ratelimit import limits, sleep_and_retry ONE_MINUTE = 60 @sleep_and_retry @limits(calls=30, period=ONE_MINUTE) def callAPI...
I'm trying to use ratelimit
with the modifications you described here but for now ratelimit
does not seem to work on my side.
I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py
Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).
Lines 277-279 and lines 351-355 are not related, right? Then are you sure ratelimit
is working on your side? With only lines 351-355 and the time.sleep set to 15 minutes, I think the 429 issue is solved.
EDIT: Here is now a new error I sometimes get when downloading tweets: An error occured during an HTTP request: <urlopen error [Errno -3] Temporary failure in name resolution>
@meixingdg Did you figure out? I get the 429 error after 10000 tweets. I have tried retry and sleep but didn't work
Did anyone figure out a good solution? I have been trying to get tweets for a popular hashtag, say #coronavirus and ran into the same issue. I have been limiting the search to each day. Yet, some days the tweets are so many that I run into this issue. I used to get the 529 Error. Now, it has switched to 503 error. As someone stated, each time I retry, the total tweets before the error is even smaller. I even did as small as 50, put in a sleep for a minute, then began gathering again. Still fails after about 2000 tweets have been gathered. On the other extreme, I tried to get 10000, then put it to sleep for 65 minutes. Still not much luck. Kinda stuck. Any thoughts or solutions?
import time
from datetime import datetime, date, timedelta
def DownloadTweets(SinceDate, UntilDate, Query):
'''
Downloads all tweets from a certain month in three sessions in order to avoid sending too many requests.
Date format = 'yyyy-mm-dd'.
Query=string.
'''
since = datetime.strptime(SinceDate, '%Y-%m-%d')
until= datetime.strptime(UntilDate, '%Y-%m-%d')
tenth = since + timedelta(days = 10)
twentieth = since + timedelta(days=20)
print ('starting first download')
first = got.manager.TweetCriteria().setQuerySearch(Query).setSince(since.strftime('%Y-%m-%d')).setUntil(tenth.strftime('%Y-%m-%d'))
firstdownload = got.manager.TweetManager.getTweets(first)
firstlist=[[tweet.date, tweet.text] for tweet in firstdownload]
df_1 = pd.DataFrame.from_records(firstlist, columns = ["date", "tweet"])
#df_1.to_csv("%s_1.csv" % SinceDate)
time.sleep(600)
print ('starting second download')
second = got.manager.TweetCriteria().setQuerySearch(Query).setSince(tenth.strftime('%Y-%m-%d')).setUntil(twentieth.strftime('%Y-%m-%d'))
seconddownload = got.manager.TweetManager.getTweets(second)
secondlist=[[tweet.date, tweet.text] for tweet in seconddownload]
df_2 = pd.DataFrame.from_records(secondlist, columns = ["date", "tweet"])
#df_2.to_csv("%s_2.csv" % SinceDate)
time.sleep(600)
print ('starting third download')
third = got.manager.TweetCriteria().setQuerySearch(Query).setSince(twentieth.strftime('%Y-%m-%d')).setUntil(until.strftime('%Y-%m-%d'))
thirddownload = got.manager.TweetManager.getTweets(third)
thirdlist=[[tweet.date, tweet.text] for tweet in thirddownload]
df_3 = pd.DataFrame.from_records(thirdlist, columns = ["date", "tweet"])
#df_3.to_csv("%s_3.csv" % SinceDate)
df=pd.concat([df_1,df_2,df_3])
df.to_csv("%s.csv" % SinceDate)
return df
I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!
#------ #Example: #DownloadTweets('2019-01-01', '2019-01-31', 'klimaat')
Hi has anybody found a solution to this yet? I am trying to download the tweets for a hot topic for a single day, however, any number of requests above 10000 shows me the 429 error... @Clairedevries does your function solve this issue? and can it download hundreds of thousands of tweets for a single day?
Yes. I downloaded 312,000 tweets in a single day by running the function above. I ran the function 12 times, once for every month in 2014. You might not be able to use this exact function as it might not work for your code and was specifically made for my project, but it shows how you can run a function and have it sleep in between in order to avoid errors. It might not work if there are more than 100,000 tweets in a single day though.
I used the same function def DownloadTweets(SinceDate, UntilDate, Query) as well. In case there are too many tweets and you get an error message, break down the dates into smaller intervals and save the data in csv files in small increment.
The issue i find is that the geo information is not available. I am aware that not all tweets should have geo information but when downloading the data using Twitter api we get a small % of tweets having the geo code info
I noticed there is a buffer option in the library. By using it I could update a .csv file for each 10 tweets returned by the library. Even in the cases I got some error, the number was satisfactory for me. Basically what i did was:
def partial_results(tweets):
print(tweets.text)
tweets = got.manager.TweetManager.getTweets(tweetCriteria, bufferLength=10, receiveBuffer=partial_results)
Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?
Hey guys, playing around with time.sleep.
Does anyone know how to find out exactly how long I need to wait to retry?
Yes, you can get it!
But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.
Hey guys, playing around with time.sleep. Does anyone know how to find out exactly how long I need to wait to retry?
Yes, you can get it!
But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.
Ah okay, let me dig into that. One last thing mate - for the retry-after response header, which file in the code should I be looking at?
Appreciate it and thank you!
I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!
#------
Thank you @Clairedevries! I adopted and altered your code. The function now waits after each day for a specified amount of sleep time. 15 minutes sleep should be on the safe side given the API rate limits. This too does not work for too many tweets (say >100k) in a single day.
import GetOldTweets3 as got
import time
from datetime import datetime, timedelta
def DownloadTweets(SinceDate, UntilDate, query, sleep=900, maxtweet=0) :
#create a list of day numbers
since = datetime.strptime(SinceDate, '%Y-%m-%d')
days = list(range(0, (datetime.strptime(UntilDate, '%Y-%m-%d') - datetime.strptime(SinceDate, '%Y-%m-%d')).days+1))
tweets = []
for day in days:
init = got.manager.TweetCriteria().setQuerySearch(query).setSince((since + timedelta(days=day)).strftime('%Y-%m-%d')).setUntil((since+ timedelta(days=day+1)).strftime('%Y-%m-%d')).setMaxTweets(maxtweet)
get = got.manager.TweetManager.getTweets(init)
tweets.append([[tweet.id, tweet.date, tweet.text] for tweet in get])
print("day", day+1, "of", len(days), "completed")
print("sleeping for", sleep, "seconds")
time.sleep(sleep)
#flatten list
tweets = [tweet for sublist in tweets for tweet in sublist]
return tweets
#%%
since = "2020-02-27"
until = "2020-03-01"
tweets = DownloadTweets(since, until, query='trump', maxtweet=10, sleep=10)