twitter-tools Fetching status blocks for Tweets2011

I have been trying to get this tool to download the blocks from Tweets2011 collection, unfortunately current implementation hits the twitter api limit each time.

The Twitter limit on the read api is 180 hits per hour, see http://twitter4j.org/en/api-support.html, and 150 for unauthenticated.

I have tried

to create authenticated request to twitter 1.1 api (since older API is deprecated, and possible removed from March 2013 onwards)
parsing the content out of the web pages directly (a brittle solution!), however this doesn't work with protected accounts and missing pages

Given the number of requests generated by this solution, I am not sure how to build the Tweets2011 corpus.

Feb 02 '13 12:02 nigel-v-thomas

This is a longstanding issue that the API-based crawler can't fix. The answer is to use the HTML-based crawlers, which need updating to Twitter's current HTML layout.

Feb 21 '13 18:02 isoboroff

Thank you for the feedback.

I have tried the html based crawler, and found that it will not work with users who have closed their accounts, or made their tweets private since the original data was gathered.

I am resigned to believe that this approach of fetching status blocks from API or directly from HTML will no longer work, due to API limits and expired or private accounts and tweets.

Feb 21 '13 18:02 nigel-v-thomas

We are working on updating the HTML crawler. Stay tuned, or alternatively patches accepted ;-)

Feb 21 '13 21:02 isoboroff

I have submitted pull request https://github.com/lintool/twitter-tools/pull/12 , which uses api to fetch. I have not committed the code which fetches from HTML, it is a simple hack and suffers from problems described earlier ie expired or private tweets not being available. I can commit code if needed..

Feb 21 '13 22:02 nigel-v-thomas

Could you commit the HTML scraping code somewhere for reference?

From what I understand, there's no way around the fact that private tweets and deleted tweets aren't available.

Feb 26 '13 16:02 andrewyates

@andrewyates @isoboroff FYI, below is the quick hack to scrape HTML content https://github.com/nigel-v-thomas/twitter-tools/commit/5689ff79705f63fc84cdaa40c239ed85c06825a2

Feb 26 '13 21:02 nigel-v-thomas

Hi Nigel,

I was trying the link https://github.com/nigel-v-thomas/twitter-tools and still getting the error "Unable to parse text from this, possible change in format"... I am using html scraping.... Am i missing something?????

May 23 '13 04:05 shakirak

Hi Shakirak, it is quite likely the HTML markup has changed since I updated the code, as I said this solution is far from ideal, it would be much better to use the standard API. Depending on what your end goal is:

if it is to reconstruct the 2011 corpus, then I am not sure this solution will help, I was not able to do rebuild it.
if it is to work around the API limit, then one way you could proceed is to tweak code for paring HTML markup in this file see lines 65 to 67: https://github.com/nigel-v-thomas/twitter-tools/blob/5689ff79705f63fc84cdaa40c239ed85c06825a2/src/main/java/cc/twittertools/corpus/data/Status.java Another solution is to use the Streaming API to workaround the API limit.

May 23 '13 12:05 nigel-v-thomas

twitter-tools
twitter-tools copied to clipboard

Fetching status blocks for Tweets2011 - hits twitter api limit

twitter-tools twitter-tools copied to clipboard

Fetching status blocks for Tweets2011 - hits twitter api limit

twitter-tools
twitter-tools copied to clipboard