backend
backend copied to clipboard
Add topic mine support for the Pushshift verified twitter archive
This adds support for pulling data from the Pushshift verified twitter archive. A couple things of note:
- Implemented using Elasticsearch's scroll API for paging support.
- This makes a second round trip after getting the results to fill in
retweeted_status
andquoted_status
fields (when needed) since Pushshift optimizes for space by removing the payloads from quote tweets and retweets.
@epenn would you be able to rebase this off the top of the current master
branch?
We have the OK to deploy this. Is the code ready to merge and release?
Oh, and I have merged in master
, hope that's okay.
Could you also have a look at the failing tests too?
I can handle the first question on context and purpose. This is part of our effort to build cross-platform topics. Jason over at PushShift.io runs an archive of "verified" tweets that he ingests and maintains. This code allows us to import tweets from this archive via adding another it as a platform in a Topic. So it queries his API for matching tweets, extract any shared links from them, and adds those into the Topic to be processed (and saves the tweets too). So at a high level this lets us discover links being shared in tweets about a topic and saves attention metrics about them.
Thanks Rahul!