backend icon indicating copy to clipboard operation
backend copied to clipboard

Add topic mine support for the Pushshift verified twitter archive

Open epenn opened this issue 3 years ago • 6 comments

This adds support for pulling data from the Pushshift verified twitter archive. A couple things of note:

  • Implemented using Elasticsearch's scroll API for paging support.
  • This makes a second round trip after getting the results to fill in retweeted_status and quoted_status fields (when needed) since Pushshift optimizes for space by removing the payloads from quote tweets and retweets.

epenn avatar Dec 15 '20 14:12 epenn

@epenn would you be able to rebase this off the top of the current master branch?

pypt avatar Feb 09 '21 14:02 pypt

We have the OK to deploy this. Is the code ready to merge and release?

rahulbot avatar Jul 06 '21 14:07 rahulbot

Oh, and I have merged in master, hope that's okay.

pypt avatar Jul 19 '21 18:07 pypt

Could you also have a look at the failing tests too?

pypt avatar Jul 19 '21 18:07 pypt

I can handle the first question on context and purpose. This is part of our effort to build cross-platform topics. Jason over at PushShift.io runs an archive of "verified" tweets that he ingests and maintains. This code allows us to import tweets from this archive via adding another it as a platform in a Topic. So it queries his API for matching tweets, extract any shared links from them, and adds those into the Topic to be processed (and saves the tweets too). So at a high level this lets us discover links being shared in tweets about a topic and saves attention metrics about them.

rahulbot avatar Jul 19 '21 18:07 rahulbot

Thanks Rahul!

pypt avatar Jul 19 '21 18:07 pypt