GetOldTweets3 icon indicating copy to clipboard operation
GetOldTweets3 copied to clipboard

Too Many Requests

Open ekalhor opened this issue 5 years ago • 32 comments

I wonder if there is a way to break down the download into pieces and pause between two pieces to avoid the "Too Many Requests" error? I am getting tweets for one highly used word, and I want to break it into batches of 10,000 tweets and pause in the between batches.

ekalhor avatar Feb 02 '20 04:02 ekalhor

I am sadly having the same issue. It used to work great but not anymore. Any solution from the more experienced coders?

Jetstarkiller avatar Feb 03 '20 19:02 Jetstarkiller

I'm having the same issue. Is there a way to make the code sleep in order to pace the number of requests?

klaralindahl avatar Feb 03 '20 19:02 klaralindahl

Same issue here too, I couldn't find a feature that allows the scraping process to sleep.

JerGag avatar Feb 04 '20 07:02 JerGag

Would using a proxy do the trick? if so how? I tried to set a proxy on pycharm and on the general setting but no luck.

Jetstarkiller avatar Feb 05 '20 14:02 Jetstarkiller

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

brndnsy avatar Feb 05 '20 14:02 brndnsy

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period. Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

Jetstarkiller avatar Feb 05 '20 14:02 Jetstarkiller

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period. Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

It's not even working with 10k now :/ I'm guessing Twitter's team have put in countermeasures.

brndnsy avatar Feb 05 '20 14:02 brndnsy

I tried to alter the source code with time.delay but got a different error:

An error occured during an HTTP request: HTTP Error 503: Service Temporarily Unavailable

brndnsy avatar Feb 05 '20 17:02 brndnsy

I've been playing around with .SetMaxTweets and also tried using time.delay but still unable to collate a dataset bigger than like 10,000.

The same. I am not sure what happened. The problem is the query search I am trying to run definitely has more than 10,000 in a 24h period. Can you search by the hour? that would be annoying but allow to still scrape the data for a whole day.

Since they are mimicking the advanced search on Twitter, the smallest time unit is a day -- you can see the options included in twitter's advanced search by checking out the website.

For larger than a day, it is easy to loop through days and have it sleep. I couldn't find the download rate limit for Twitter's server. However, when I had it sleep 16 minutes, between two days, It seemed to recover from the Too Many Request error.

My problem is for when I download one day of the Tweets for a common word. I think it doesn't have much to do with the Advanced Search functionality anymore, which is a good thing. Similar to max tweets there should be a way to cap the downloads, but we want to keep the search alive --maybe that's the issue in your case @lethalbeans? and have a placeholder for resuming the download after sleep.

ekalhor avatar Feb 05 '20 19:02 ekalhor

@elkalhor thanks I've managed to get 20k so far, aiming for 100k at least.

Yeah that would be good. The current code seems to use try catch exceptions, and ends the code abruptly if the json request is not fulfilled. I'm currently trying to implement ratelimit:

https://github.com/tomasbasham/ratelimit

brndnsy avatar Feb 11 '20 00:02 brndnsy

I would propose to change the current exiting of the script to a unified exception. That way the user can decide to catch the exception and use any method to retry the request later on for the moment. Using sys.exit() without any error code, seems like the worst way to handle it.

Besides that, the approach from @lethalbeans seems like a really good idea to me. Do you have progress on that?

sebimarkgraf avatar Feb 16 '20 18:02 sebimarkgraf

@ekalhor, this should do the trick

https://github.com/Mottl/GetOldTweets3/issues/3#issuecomment-527642499

bcornet1 avatar Mar 05 '20 09:03 bcornet1

@lethalbeans did you get ratelimit to work?

libbyh avatar Mar 06 '20 19:03 libbyh

has anybody found a solution for this? I get the same error when I get 15000 to 20000 tweets

mohamadre3a avatar Mar 09 '20 13:03 mohamadre3a

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:

from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

libbyh avatar Mar 10 '20 12:03 libbyh

I tried it and still got the "too many requests" could you please share your complete code? (I mean the callAPI function)

mohamadre3a avatar Mar 10 '20 13:03 mohamadre3a

I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py

Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).

libbyh avatar Mar 10 '20 14:03 libbyh

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:

from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

Has anyone solved the certificate expired error or know what's causing it? I can't tell if it's a time limit-related issue or something else. I'm also getting the certificate expired error, but it has happened after 700 tweets, 4100 tweets, 0 tweets...

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1045)>

meixingdg avatar Mar 15 '20 17:03 meixingdg

Hi @libbyh @meixingdg, I have been able to solve the SSL certificate with the following two lines at the beginning of the TweetManager.py file:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

I used ratelimit to solve the 429 problem, but I eventually got another error:

An error occured during an HTTP request: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)>

To use rate limit:

from ratelimit import limits, sleep_and_retry
ONE_MINUTE = 60

@sleep_and_retry
@limits(calls=30, period=ONE_MINUTE)
def callAPI...

I'm trying to use ratelimit with the modifications you described here but for now ratelimit does not seem to work on my side.

I'm still working so haven't made a PR, but here's the code: https://github.com/libbyh/GetOldTweets-python/blob/py3/GetOldTweets3/manager/TweetManager.py

Kinda sloppy because it includes the ratelimit approach (lines 277-279) and a catch-and-sleep for the 429 (lines 351 - 355).

Lines 277-279 and lines 351-355 are not related, right? Then are you sure ratelimit is working on your side? With only lines 351-355 and the time.sleep set to 15 minutes, I think the 429 issue is solved.

EDIT: Here is now a new error I sometimes get when downloading tweets: An error occured during an HTTP request: <urlopen error [Errno -3] Temporary failure in name resolution>

MichaelKarpe avatar Mar 23 '20 20:03 MichaelKarpe

@meixingdg Did you figure out? I get the 429 error after 10000 tweets. I have tried retry and sleep but didn't work

preetham-salehundam avatar Mar 26 '20 01:03 preetham-salehundam

Did anyone figure out a good solution? I have been trying to get tweets for a popular hashtag, say #coronavirus and ran into the same issue. I have been limiting the search to each day. Yet, some days the tweets are so many that I run into this issue. I used to get the 529 Error. Now, it has switched to 503 error. As someone stated, each time I retry, the total tweets before the error is even smaller. I even did as small as 50, put in a sleep for a minute, then began gathering again. Still fails after about 2000 tweets have been gathered. On the other extreme, I tried to get 10000, then put it to sleep for 65 minutes. Still not much luck. Kinda stuck. Any thoughts or solutions?

spavank avatar May 29 '20 01:05 spavank

import time
from datetime import datetime, date, timedelta

def DownloadTweets(SinceDate, UntilDate, Query):
    '''
    Downloads all tweets from a certain month in three sessions in order to avoid sending too many requests. 
    Date format = 'yyyy-mm-dd'. 
    Query=string.
    '''
    since = datetime.strptime(SinceDate, '%Y-%m-%d')
    until= datetime.strptime(UntilDate, '%Y-%m-%d')
    tenth = since + timedelta(days = 10)
    twentieth = since + timedelta(days=20)
    
    print ('starting first download')
    first = got.manager.TweetCriteria().setQuerySearch(Query).setSince(since.strftime('%Y-%m-%d')).setUntil(tenth.strftime('%Y-%m-%d'))
    firstdownload = got.manager.TweetManager.getTweets(first)
    firstlist=[[tweet.date, tweet.text] for tweet in firstdownload]
    
    df_1 = pd.DataFrame.from_records(firstlist, columns = ["date", "tweet"])
    #df_1.to_csv("%s_1.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting second download')
    second = got.manager.TweetCriteria().setQuerySearch(Query).setSince(tenth.strftime('%Y-%m-%d')).setUntil(twentieth.strftime('%Y-%m-%d'))
    seconddownload = got.manager.TweetManager.getTweets(second)
    secondlist=[[tweet.date, tweet.text] for tweet in seconddownload]
    
    df_2 = pd.DataFrame.from_records(secondlist, columns = ["date", "tweet"])
    #df_2.to_csv("%s_2.csv" % SinceDate)
    
    time.sleep(600)
    
    print ('starting third download')
    third = got.manager.TweetCriteria().setQuerySearch(Query).setSince(twentieth.strftime('%Y-%m-%d')).setUntil(until.strftime('%Y-%m-%d'))
    thirddownload = got.manager.TweetManager.getTweets(third)
    thirdlist=[[tweet.date, tweet.text] for tweet in thirddownload]
    
    df_3 = pd.DataFrame.from_records(thirdlist, columns = ["date", "tweet"])
    #df_3.to_csv("%s_3.csv" % SinceDate)
    
    df=pd.concat([df_1,df_2,df_3])
    df.to_csv("%s.csv" % SinceDate)
  
    return df

I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!

#------ #Example: #DownloadTweets('2019-01-01', '2019-01-31', 'klimaat')

Clairedevries avatar Jun 05 '20 15:06 Clairedevries

Hi has anybody found a solution to this yet? I am trying to download the tweets for a hot topic for a single day, however, any number of requests above 10000 shows me the 429 error... @Clairedevries does your function solve this issue? and can it download hundreds of thousands of tweets for a single day?

ArjunAcharya0311 avatar Jun 22 '20 15:06 ArjunAcharya0311

Yes. I downloaded 312,000 tweets in a single day by running the function above. I ran the function 12 times, once for every month in 2014. You might not be able to use this exact function as it might not work for your code and was specifically made for my project, but it shows how you can run a function and have it sleep in between in order to avoid errors. It might not work if there are more than 100,000 tweets in a single day though.

Clairedevries avatar Jun 22 '20 15:06 Clairedevries

I used the same function def DownloadTweets(SinceDate, UntilDate, Query) as well. In case there are too many tweets and you get an error message, break down the dates into smaller intervals and save the data in csv files in small increment.

The issue i find is that the geo information is not available. I am aware that not all tweets should have geo information but when downloading the data using Twitter api we get a small % of tweets having the geo code info

badaouisaad avatar Jul 14 '20 17:07 badaouisaad

I noticed there is a buffer option in the library. By using it I could update a .csv file for each 10 tweets returned by the library. Even in the cases I got some error, the number was satisfactory for me. Basically what i did was:

def partial_results(tweets):
    print(tweets.text)

tweets = got.manager.TweetManager.getTweets(tweetCriteria, bufferLength=10, receiveBuffer=partial_results)

cefasgarciapereira avatar Jul 23 '20 02:07 cefasgarciapereira

Hey guys, playing around with time.sleep.

Does anyone know how to find out exactly how long I need to wait to retry?

darrenlimweiyang avatar Jul 26 '20 09:07 darrenlimweiyang

Hey guys, playing around with time.sleep.

Does anyone know how to find out exactly how long I need to wait to retry?

Yes, you can get it!

But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.

cefasgarciapereira avatar Jul 26 '20 17:07 cefasgarciapereira

Hey guys, playing around with time.sleep. Does anyone know how to find out exactly how long I need to wait to retry?

Yes, you can get it!

But... The library stops the program when it gets some error. What you can do is download the source code and change it by yourself. If the response gets 429 code, you can get the time value (in seconds) in the "Retry-after" response's header. Another way to handle it is to put some arbitrary value and increases it if you get the error again.

Ah okay, let me dig into that. One last thing mate - for the retry-after response header, which file in the code should I be looking at?

Appreciate it and thank you!

darrenlimweiyang avatar Jul 27 '20 03:07 darrenlimweiyang

I wrote this function in order to download all tweets (matching a certain query) from a month in three sections, while sleeping in between requests. This allows me to download 20,000+ tweets in under an hour. Obviously, the following function takes SinceDate and adds 10 and 20 days, but you guys can change the format as needed for your own projects. Hope it's helpful!

#------

Thank you @Clairedevries! I adopted and altered your code. The function now waits after each day for a specified amount of sleep time. 15 minutes sleep should be on the safe side given the API rate limits. This too does not work for too many tweets (say >100k) in a single day.

import GetOldTweets3 as got
import time
from datetime import datetime, timedelta

def DownloadTweets(SinceDate, UntilDate, query, sleep=900, maxtweet=0) :
    #create a list of day numbers
    since = datetime.strptime(SinceDate, '%Y-%m-%d')
    days = list(range(0, (datetime.strptime(UntilDate, '%Y-%m-%d') - datetime.strptime(SinceDate, '%Y-%m-%d')).days+1))
    tweets = []
  
    for day in days:
        init = got.manager.TweetCriteria().setQuerySearch(query).setSince((since + timedelta(days=day)).strftime('%Y-%m-%d')).setUntil((since+ timedelta(days=day+1)).strftime('%Y-%m-%d')).setMaxTweets(maxtweet)
        get = got.manager.TweetManager.getTweets(init)
        tweets.append([[tweet.id, tweet.date, tweet.text] for tweet in get])
        print("day", day+1, "of", len(days), "completed")
        print("sleeping for", sleep, "seconds")
        time.sleep(sleep)
    #flatten list
    tweets = [tweet for sublist in tweets for tweet in sublist]
    return tweets

#%%
since = "2020-02-27"
until = "2020-03-01"

tweets = DownloadTweets(since, until, query='trump', maxtweet=10, sleep=10)

tredmill avatar Jul 27 '20 15:07 tredmill