twitterscraper icon indicating copy to clipboard operation
twitterscraper copied to clipboard

KeyError: 'items_html' when scraping

Open erb13020 opened this issue 5 years ago • 19 comments

When I am scraping for tweets for a given day, twitterscraper stops scraping tweets for that day and returns the following error.

ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 173, in query_tweets_once_generator
    new_tweets, new_pos = query_single_page(query, lang, pos)
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 100, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'

Sometimes it will gather up to 20,000 tweets for a certain query on a certain day. Sometimes it will stop at around 20 tweets. Here is my full output for scraping all tweets about 'tesla' on March 1, 2020.

INFO: {'User-Agent': 'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 'X-Requested-With': 'XMLHttpRequest'}
Scraping tweets for 1/3/2020
INFO: queries: ['tesla since:2020-03-01 until:2020-03-02']
INFO: Querying tesla since:2020-03-01 until:2020-03-02
INFO: Scraping tweets from https://twitter.com/search?f=tweets&vertical=default&q=tesla%20since%3A2020-03-01%20until%3A2020-03-02&l=
INFO: Using proxy 119.2.54.204:31322
INFO: Scraping tweets from https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-1234265965061386240-1234267078539898880&q=tesla%20since%3A2020-03-01%20until%3A2020-03-02&l=
INFO: Using proxy 128.199.214.87:3128
ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 173, in query_tweets_once_generator
    new_tweets, new_pos = query_single_page(query, lang, pos)
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 100, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'
INFO: Got 18 tweets for tesla%20since%3A2020-03-01%20until%3A2020-03-02.
INFO: Got 18 tweets (18 new).
Scraped 13 tweets for 1/3/2020

Here is my code

HEADERS_LIST = [
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13',
    'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201',
    'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16',
    'Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre'
]

twitterscraper.query.HEADER = {'User-Agent': random.choice(HEADERS_LIST), 'X-Requested-With': 'XMLHttpRequest'}

def scrape(d, m, y, query):

    begin_date = dt.date(y, m, d)
    end_date = begin_date + dt.timedelta(days=1)

    tweets = query_tweets(query, begindate=begin_date, enddate=end_date)

    df = pd.DataFrame(t.__dict__ for t in tweets)

    print('Scraped ' + str(len(df)) + ' tweets for ' + str(d) + '/' + str(m) + '/' + str(y))

    return df

I have tried looking around and played with different poolsizes but I'm not really sure what the issue is or where to start with fixing it. I am currently using version 1.5.0. Thank you!

erb13020 avatar Jul 26 '20 20:07 erb13020

I am also facing the same issue. Any updates on how to fix this?

Shagun-25 avatar Jul 27 '20 07:07 Shagun-25

I am getting this error for anything more than 10k tweets

Edit: Now showing up for even a lower number of tweets

ashgreat avatar Jul 27 '20 22:07 ashgreat

There is an try / except around that block of code, but specific to ValueErrors. I'll add KeyError as well so it does not break like this. You can except a new version on pypi later today. Thanks for bringing this up.

taspinar avatar Jul 28 '20 06:07 taspinar

@taspinar Thank you so much! I'll let you know if it works.

erb13020 avatar Jul 28 '20 08:07 erb13020

It should be fixed in version 1.6.1 available on pypi.

taspinar avatar Jul 28 '20 11:07 taspinar

Thank you @taspinar , it works and scrapes much more tweets than before

erb13020 avatar Jul 28 '20 21:07 erb13020

It should be fixed in version 1.6.1 available on pypi.

I uninstalled the previous version and installed the new one. I am using CLI and it still throws up the same error.

ashgreat avatar Jul 29 '20 00:07 ashgreat

After testing, I'm getting a similar but different error. Its scraping a lot more tweets than before, but I wanted to share what I've been getting with version 1.6.1

Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 107, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'

It doesn't cut off scraping, but this repeats a lot when I run the scraper.

erb13020 avatar Jul 29 '20 08:07 erb13020

In my case, I searched for an airline @SouthwestAir over 2020-01-01 and 2020-03-31 and requested 50,000 tweets. It kept running even when it displays this error. However, the number of tweets returned was less than 200.

ashgreat avatar Jul 29 '20 11:07 ashgreat

@ashgreat Same for me, except I was able to get a lot more tweets about Tesla - somewhere in the tens of thousands. Its much better than version 1.5.0

erb13020 avatar Jul 29 '20 19:07 erb13020

I am having the same problem, this is the error that comes up

Failed to parse JSON while requesting "https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-1188605683345903616-1188606295571619846&q=bush%20since%3A2019-10-27%20until%3A2019-10-28&l=english"
Traceback (most recent call last):
  File "c:\users\chiar\appdata\local\programs\python\python38-32\lib\site-packages\twitterscraper\query.py", line 107, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'
INFO:twitterscraper:Got 205 tweets (19 new).

I was trying to get 10.000 tweets but I get this error every time. I can scrape 100 tweets but no more than that. I am currently using ver 1.6.1 and still getting this issue.

160Bobo avatar Aug 04 '20 09:08 160Bobo

I am getting the same error, I printed the Json_resp and it seems like it is returning a message the json is {'message': 'Sorry, you are rate limited.'} . Maybe they caught wind of this trick and now it doesn't work?

also if any one could explain poolsize and also how to pull the user information that would be great! as of now I am trying to make the json by doing the following

json_lst=[]    
for username in ['@Nike','@UnderArmour','@Adidas']:
        for tweet in query_tweets(username,begindate=dt.date(2017, 1, 1),enddate=dt.date.today(),poolsize=20, lang='', use_proxy=True):
            tweet.__dict__['timestamp'] = str(tweet.__dict__['timestamp'])
            json_lst.append(tweet)
        with open(f'{username}.json', 'a') as f:
            json.dump(json_lst,f, indent=4)
        json_lst = []```

ThomasADuffy avatar Aug 05 '20 20:08 ThomasADuffy

When I am scraping for tweets for a given day, twitterscraper stops scraping tweets for that day and returns the following error.

ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 173, in query_tweets_once_generator
    new_tweets, new_pos = query_single_page(query, lang, pos)
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 100, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'

Sometimes it will gather up to 20,000 tweets for a certain query on a certain day. Sometimes it will stop at around 20 tweets. Here is my full output for scraping all tweets about 'tesla' on March 1, 2020.

INFO: {'User-Agent': 'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 'X-Requested-With': 'XMLHttpRequest'}
Scraping tweets for 1/3/2020
INFO: queries: ['tesla since:2020-03-01 until:2020-03-02']
INFO: Querying tesla since:2020-03-01 until:2020-03-02
INFO: Scraping tweets from https://twitter.com/search?f=tweets&vertical=default&q=tesla%20since%3A2020-03-01%20until%3A2020-03-02&l=
INFO: Using proxy 119.2.54.204:31322
INFO: Scraping tweets from https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-1234265965061386240-1234267078539898880&q=tesla%20since%3A2020-03-01%20until%3A2020-03-02&l=
INFO: Using proxy 128.199.214.87:3128
ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 173, in query_tweets_once_generator
    new_tweets, new_pos = query_single_page(query, lang, pos)
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 100, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'
INFO: Got 18 tweets for tesla%20since%3A2020-03-01%20until%3A2020-03-02.
INFO: Got 18 tweets (18 new).
Scraped 13 tweets for 1/3/2020

Here is my code

HEADERS_LIST = [
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13',
    'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201',
    'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16',
    'Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre'
]

twitterscraper.query.HEADER = {'User-Agent': random.choice(HEADERS_LIST), 'X-Requested-With': 'XMLHttpRequest'}

def scrape(d, m, y, query):

    begin_date = dt.date(y, m, d)
    end_date = begin_date + dt.timedelta(days=1)

    tweets = query_tweets(query, begindate=begin_date, enddate=end_date)

    df = pd.DataFrame(t.__dict__ for t in tweets)

    print('Scraped ' + str(len(df)) + ' tweets for ' + str(d) + '/' + str(m) + '/' + str(y))

    return df

I have tried looking around and played with different poolsizes but I'm not really sure what the issue is or where to start with fixing it. I am currently using version 1.5.0. Thank you!

Same here

JagritiJ avatar Aug 08 '20 15:08 JagritiJ

Looks like @ThomasADuffy identified the issue with #333

{'message': 'Sorry, you are rate limited.'} --- This is the json_resp.

clbarrell avatar Aug 12 '20 05:08 clbarrell

Is there any method to fix it?

HeroadZ avatar Aug 27 '20 14:08 HeroadZ

as of now, Not sure. adding a sleep timer possibly with a couple of seconds might help. It probably is flagging the IP address coming in because its going through pages too fast.

ThomasADuffy avatar Aug 27 '20 14:08 ThomasADuffy

@ThomasADuffy Thank you for you advices. But with a sleep timer, we get data with more than 1 time. Will the data be redundant that we might get same tweets sometimes?

HeroadZ avatar Aug 27 '20 14:08 HeroadZ

The time sleeper can't handle if I want to scrape more than 10k tweets in one query. If I split the limit into the smaller one like 1k and using a time sleeper, the 1k data is always the same. (And it seems like the limit parameter is not work here, the length of data is variable.)

So is there any idea on query_single_page? We could split the limit into the number of pages and scrape in a for-loop. I have no idea with the parameter "pos".

HeroadZ avatar Aug 28 '20 03:08 HeroadZ

After multiple try and updating query.py i am also stuck with error items_html, May be this because the new API from twitter. Waiting for a fix.

abhilashpanda04 avatar Sep 24 '20 04:09 abhilashpanda04