twitter-tools icon indicating copy to clipboard operation
twitter-tools copied to clipboard

Fetching status blocks for Tweets2011 - hits twitter api limit

Open nigel-v-thomas opened this issue 12 years ago • 8 comments

I have been trying to get this tool to download the blocks from Tweets2011 collection, unfortunately current implementation hits the twitter api limit each time.

The Twitter limit on the read api is 180 hits per hour, see http://twitter4j.org/en/api-support.html, and 150 for unauthenticated.

I have tried

  • to create authenticated request to twitter 1.1 api (since older API is deprecated, and possible removed from March 2013 onwards)
  • parsing the content out of the web pages directly (a brittle solution!), however this doesn't work with protected accounts and missing pages

Given the number of requests generated by this solution, I am not sure how to build the Tweets2011 corpus.

nigel-v-thomas avatar Feb 02 '13 12:02 nigel-v-thomas

This is a longstanding issue that the API-based crawler can't fix. The answer is to use the HTML-based crawlers, which need updating to Twitter's current HTML layout.

isoboroff avatar Feb 21 '13 18:02 isoboroff

Thank you for the feedback.

I have tried the html based crawler, and found that it will not work with users who have closed their accounts, or made their tweets private since the original data was gathered.

I am resigned to believe that this approach of fetching status blocks from API or directly from HTML will no longer work, due to API limits and expired or private accounts and tweets.

nigel-v-thomas avatar Feb 21 '13 18:02 nigel-v-thomas

We are working on updating the HTML crawler. Stay tuned, or alternatively patches accepted ;-)

isoboroff avatar Feb 21 '13 21:02 isoboroff

I have submitted pull request https://github.com/lintool/twitter-tools/pull/12 , which uses api to fetch. I have not committed the code which fetches from HTML, it is a simple hack and suffers from problems described earlier ie expired or private tweets not being available. I can commit code if needed..

nigel-v-thomas avatar Feb 21 '13 22:02 nigel-v-thomas

Could you commit the HTML scraping code somewhere for reference?

From what I understand, there's no way around the fact that private tweets and deleted tweets aren't available.

andrewyates avatar Feb 26 '13 16:02 andrewyates

@andrewyates @isoboroff FYI, below is the quick hack to scrape HTML content https://github.com/nigel-v-thomas/twitter-tools/commit/5689ff79705f63fc84cdaa40c239ed85c06825a2

nigel-v-thomas avatar Feb 26 '13 21:02 nigel-v-thomas

Hi Nigel,

I was trying the link https://github.com/nigel-v-thomas/twitter-tools and still getting the error "Unable to parse text from this, possible change in format"... I am using html scraping.... Am i missing something?????

shakirak avatar May 23 '13 04:05 shakirak

Hi Shakirak, it is quite likely the HTML markup has changed since I updated the code, as I said this solution is far from ideal, it would be much better to use the standard API. Depending on what your end goal is:

  • if it is to reconstruct the 2011 corpus, then I am not sure this solution will help, I was not able to do rebuild it.
  • if it is to work around the API limit, then one way you could proceed is to tweak code for paring HTML markup in this file see lines 65 to 67: https://github.com/nigel-v-thomas/twitter-tools/blob/5689ff79705f63fc84cdaa40c239ed85c06825a2/src/main/java/cc/twittertools/corpus/data/Status.java Another solution is to use the Streaming API to workaround the API limit.

nigel-v-thomas avatar May 23 '13 12:05 nigel-v-thomas