GetOldTweets3 icon indicating copy to clipboard operation
GetOldTweets3 copied to clipboard

HTTP Error, Gives 404 but the URL is working

Open sagefuentes opened this issue 3 years ago • 144 comments

Hi, I had a script running over the past weeks and earlier today it stopped working. I keep receiving HTTPError 404, but the provided link in the errors still brings me to a valid page. Code is (all mentioned variables are established and the error specifically happens with the Manager when I check via debugging): tweetCriteria = got.manager.TweetCriteria().setQuerySearch(term)\ .setMaxTweets(max_count)\ .setSince(begin_timeframe)\ .setUntil(end_timeframe) scraped_tweets = got.manager.TweetManager.getTweets(tweetCriteria)

The error message for this is the standard 404 error "An error occured during an HTTP request: HTTP Error 404: Not Found Try to open in browser:" followed by the valid link

As I have changed nothing about the folder, I am wondering if something has happened with my configurations more so than anything else, but wondering if others are experiencing this.

sagefuentes avatar Sep 18 '20 01:09 sagefuentes

Hello @sagefuentes, I'm dealing with the exact same issue, I also have been downloading tweets for the past weeks and it suddenly stops working giving me error 404 with a valid link.

I've no idea what might be the cause...

alberto-valdes avatar Sep 18 '20 01:09 alberto-valdes

So like me. I also suddenly encounter this problem today, but all things went well yesterday. 下載

caiyishu avatar Sep 18 '20 02:09 caiyishu

I am dealing with the same issue here. This is something new today and is caused by some changes/bugs on Twitter server side. If using the command with debug=True, the URL used to get tweets is no longer available. Seeking for solution now.

taoyudong avatar Sep 18 '20 03:09 taoyudong

Also started having the same issue today.

mwaters166 avatar Sep 18 '20 04:09 mwaters166

I'm having the same issue as well! Does anyone have a solution for it?

MithilaGuha avatar Sep 18 '20 05:09 MithilaGuha

Yes. I am having the same issue. Guess everyone are having the issue.

baraths92 avatar Sep 18 '20 05:09 baraths92

I'm not sure if it is related to this issue, but some of the user_agents seem to be out of date

    user_agents = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:62.0) Gecko/20100101 Firefox/62.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15',
    ]

https://github.com/Mottl/GetOldTweets3/blob/54a8e73d56953c7f664e073e121fa1413de4fbff/GetOldTweets3/manager/TweetManager.py#L13

alastairrushworth avatar Sep 18 '20 06:09 alastairrushworth

same

stevedwards avatar Sep 18 '20 07:09 stevedwards

Seems to be a "bigger" problem? Also other scrappers have problems. https://github.com/twintproject/twint/issues/915#issue-704034135

Sebastokratos42 avatar Sep 18 '20 08:09 Sebastokratos42

Here is debug enabled. It shows the actual url being called, and it seems that twitter has removed the /i/search/timeline endpoint. :(

https://twitter.com/i/search/timeline?vertical=news&q=from%3AREDACTED&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Host: twitter.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/i/search/timeline?vertical=news&q=from%3AREDACTED&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Connection: keep-alive
An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=%20from%3AREDACTED&src=typd

Daviey avatar Sep 18 '20 08:09 Daviey

Same problem, damn

danielo93 avatar Sep 18 '20 09:09 danielo93

I'm not sure if it is related to this issue, but some of the user_agents seem to be out of date

I forked and created a branch to allow a user-specified UA, using samples from my current browser doesn't fix the problem.

I notice the search and referrer URL shown in--debug output (https://twitter.com/i/search/timeline) returns a 404 error:

$ GetOldTweets3 --username twitter --debug 
/home/inactivist/.local/bin/GetOldTweets3 --username twitter --debug
GetOldTweets3 0.0.11
Downloading tweets...
https://twitter.com/i/search/timeline?f=tweets&vertical=news&q=from%3Atwitter&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Host: twitter.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/i/search/timeline?f=tweets&vertical=news&q=from%3Atwitter&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Connection: keep-alive
An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=%20from%3Atwitter&src=typd
$ curl -I https://twitter.com/i/search/timeline
HTTP/2 404 
[snip]

EDIT The url used for the internal search, and the one shown in the exception message, aren't the same...

inactivist avatar Sep 18 '20 11:09 inactivist

I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.

baraths92 avatar Sep 18 '20 11:09 baraths92

Unfortunately i have same problem, i hope we find a solution as soon as possible.

herdemo avatar Sep 18 '20 11:09 herdemo

I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.

Switching to mobile.twitter.com/search and using a modern User-Agent header seems to get us past the 400 bad request error, but then we get Error parsing JSON...

inactivist avatar Sep 18 '20 11:09 inactivist

Same thing for me. I get an error 404 but the URL is working.

rennanharo avatar Sep 18 '20 12:09 rennanharo

I have same issue

shelu16 avatar Sep 19 '20 02:09 shelu16

I am experiencing the same issue. Any plan to fix the issue?

maldil avatar Sep 19 '20 05:09 maldil

same issue, somebody help.

alifzl avatar Sep 19 '20 05:09 alifzl

Same issue. The same code was working a day back now its giving error 404 with a valid link

chinmuxmaximus avatar Sep 19 '20 06:09 chinmuxmaximus

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

GabrielEspeschit avatar Sep 19 '20 14:09 GabrielEspeschit

I am having the same issue. It was more robust than Tweepy. I hope we find a solution as soon as possible.

rsafa avatar Sep 19 '20 15:09 rsafa

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api.
I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis

herdemo avatar Sep 19 '20 18:09 herdemo

I have same issue. Need some help here

Fiyinfoluwa6 avatar Sep 20 '20 12:09 Fiyinfoluwa6

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api. I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis

I see! I'm fairly new to scrapping, but I'm working on a end of course thesis about sentiment analysis and could really use some newer tweets to help me out.

I've been tinkering with GOT3's code a bit and got it to read the HTML of the search timeline, however it's mostly unformatted. Like I said, I have little experience with scrapping so I'm really struggling to format it correctly. However, I will note my changes, for reference and for someone with more experience to pick-up if they so wish:

  • updated user_agents (updated with the ones used by TWINT);

  • updated endpoint (/search?)

  • some updates to the URL structure:

      url = "https://twitter.com/search?"

        

        url += ("q=%%20%s&src=typd%s"
                "&include_available_features=1&include_entities=1&max_position=%s"
                "&reset_error_state=false")

        if not tweetCriteria.topTweets:
            url += "&f=live"`

Edit: Forgot to say this. Sometimes the application gives me a 400: Bad Request, I run it again, and it outputs the HTML like said before.

GabrielEspeschit avatar Sep 20 '20 15:09 GabrielEspeschit

same issue here, anyone has any idea how to solve this?

brianleegit avatar Sep 21 '20 07:09 brianleegit

same issue here, It stopped working from past three days.

afghanpasha avatar Sep 21 '20 09:09 afghanpasha

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api. I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis

I see! I'm fairly new to scrapping, but I'm working on a end of course thesis about sentiment analysis and could really use some newer tweets to help me out.

I've been tinkering with GOT3's code a bit and got it to read the HTML of the search timeline, however it's mostly unformatted. Like I said, I have little experience with scrapping so I'm really struggling to format it correctly. However, I will note my changes, for reference and for someone with more experience to pick-up if they so wish:

  • updated user_agents (updated with the ones used by TWINT);
  • updated endpoint (/search?)
  • some updates to the URL structure:
      url = "https://twitter.com/search?"

        

        url += ("q=%%20%s&src=typd%s"
                "&include_available_features=1&include_entities=1&max_position=%s"
                "&reset_error_state=false")

        if not tweetCriteria.topTweets:
            url += "&f=live"`

Edit: Forgot to say this. Sometimes the application gives me a 400: Bad Request, I run it again, and it outputs the HTML like said before.

The html problem can be easily solved with some BeautifulSoup manipulations. However I can not get the BeautifulSoup functions to work as it constantly gets and error: Error parsing JSON: <?xml version="1.0" encoding="utf-8"?> And it does not allow me to continue using the obtained HTML.

Any idea on how to avoid this error?

JacoboGuijar avatar Sep 21 '20 09:09 JacoboGuijar

same issue here, I think this is because twitter has removed the endpoint https://twitter.com/i/search/timeline?

007sumitsingh avatar Sep 21 '20 11:09 007sumitsingh

Same issue with me and some of my other colleagues. I noticed other scrapers are also running into this problem like Twint https://github.com/twintproject/twint/issues/918

It seems to follow along the same logic mentioned above that it's likely an update on Twitter's part that is making these scraping libraries not work.

MartinBeckUT avatar Sep 21 '20 20:09 MartinBeckUT