minet icon indicating copy to clipboard operation
minet copied to clipboard

String format for --url-template command line argument

Open martysteer opened this issue 3 years ago • 8 comments

I've searched through the source code and documentation but can't seem to find any examples which demonstrate the use of the --url-format command line argument for Minet resolve. I figured out it enricher.wraps() the row and somehow this needs curly braces to index the value, but I keep getting a key error and I'm not sure how to refer to the internal row/column to get the value. Do I refer to the value by CSV column name? or row[columnname]?

e.g minet resolve id --url-template 'https://twitter.com/i/web/status/{id}'

Would be great if you could add an example use case to the docs for the resolve command --url-template!

Thx

martysteer avatar Jul 06 '21 17:07 martysteer

Hello @martysteer. Those templates are not well documented indeed, my bad, because they are not completely stable (or were not? maybe they are now). The reason why your template does not work is that --url-template only gets the value kwarg, being the value of the selected column in the current row. So your example should be:

minet resolve id --url-template 'https://twitter.com/i/web/status/{value}'

instead. I could add access to the whole row like with --filename-template, but truth be told, no one had a use for it yet. Maybe you do.

I will let your issue open to remind me to add the proper documentation now.

Yomguithereal avatar Jul 07 '21 07:07 Yomguithereal

Thankyou @Yomguithereal. The {value} works nicely!

It's not just tweets I'm using minet for, however my current usage is tweets. I'm trying to determine if a bunch of tweet ids are still online or deleted or blocked. (They are over a year old and this is about reproducible research methods). The rate limit on the developer API is way too slow to rehydrate the entire tweet id corpus. Unfortunately twitter doesn't return good HTTP status codes because it returns 200 for most everything, and hides the tweet object statuses behind AJAX responses... so I may have to resort to scraping with selenium.

Anyway, a related example which you might consider putting into the minet documentation (grin)... Because minet uses csv column header names and my hundreds of CSV files don't have header names, I figured out how to inject the single header I needed using sed. Works when piped into minet:

sed '1 s/^/tid,,\n/' tweets.csv | minet resolve tid --url-template 'https://twitter.com/i/web/status/{value}' > resolve-report.csv

My csv has 3 columns and the tweet id is in the first column so I used empty commas for empty columns names - just enough to let minet grab the one I wanted.

Thanks again for your help (and your wonderful tool!)

martysteer avatar Jul 07 '21 09:07 martysteer

Unfortunately twitter doesn't return good HTTP status codes because it returns 200 for most everything, and hides the tweet object statuses behind AJAX responses... so I may have to resort to scraping with selenium.

Or you could simulate those AJAX calls instead :) This is what I do to scrape the search when using minet twitter scrape tweets for instance. If you give me some tweet urls examples (with now unavailable one) I can test some things to help you.

Because minet uses csv column header names and my hundreds of CSV files don't have header names, I figured out how to inject the single header I needed using sed.

I could also maybe add some way to specify that you want to pass a header-less csv file, with --no-headers for instance, and let you give your column as an index.

Yomguithereal avatar Jul 07 '21 10:07 Yomguithereal

Rereading this part of your answer:

The rate limit on the developer API is way too slow to rehydrate the entire tweet id corpus.

I am wondering how scraping the tweets one by one could be slower than the batch methods of the Twitter API. The v1 lookup endpoint, for instance, can retrieve tweets by 100 at once: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/post-and-engage/api-reference/get-statuses-lookup The v2 also enables this I think: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/post-and-engage/api-reference/get-statuses-lookup

Do you want to work with me to add a new minet twitter command to hydrate lists of tweet ids?

Yomguithereal avatar Jul 08 '21 09:07 Yomguithereal

Do you want to work with me to add a new minet twitter command to hydrate lists of tweet ids?

Yeah! I found a couple of example tweets yesterday. I'll drop you an email with the details. Might need a new GitHub issue too. :-)

martysteer avatar Jul 08 '21 10:07 martysteer

FYI, using the v1 API, hydrating by ids 100 by 100 using both app and user Oauth accesses, you can collect up to 11 520 000 tweets over 24h.

boogheta avatar Jul 08 '21 12:07 boogheta

Hello @martysteer. @ameliepelle just added a twitter attrition command that is able to check availability status of large batches of tweets and return a reason why they are not available when not found. I reckon it should be faster than twarc in this regard as we are able to batch some things (we reversed and tweaked twarc's logic a little to use batch API queries). @ameliepelle will now add things related to unavailable retweets). Just note that the command is not yet released but available on master branch.

Yomguithereal avatar Oct 13 '21 13:10 Yomguithereal

@Yomguithereal @ameliepelle Amazing! Thankyou so much. I really think others will find this useful too. I'll test it out on my next corpus when that is ready (soon!) and send you any notes.

BTW, I've been running a slightly tweaked twarc version since July when we last spoke... It only completed last week. haha!

martysteer avatar Oct 13 '21 13:10 martysteer