twitter-archive-parser Expand shortened urls

trafficstars

Starting from the code uploaded by Atari-Frosch in #9, I added functionality to expand shortened URLs from known shorteners. This PR would fix #36 and #38 as well.

Nov 13 '22 23:11 abaumg

I would actually prefer this to be a separate script - I've got Tweets going back to 2009 and I can imagine if I hammer some of those shorteners with requests, it's going to block the script. If it just looped through the generated .md files looking for the shortened links and then expanded them, it'll allow the script to abort/die/be cancelled and restarted later (perhaps from a different IP address) without any loss of progress.

Nov 13 '22 23:11 rbairwell

~I agree with @rbairwell. Let's keep parser.py as a tool that parses the archive using only local data. Extended functionality like download_better_images.py and URL expansion can be handled by separate tools.~

@abaumg How do you feel about changing the PR to:

~move the code into a new script expand_urls.py that updates *.md in its local folder~
~add a comment to the end of parser.py to tell the user about the possibility of expanding the URLs~

Do we want to add a small sleep into the loop to decrease the likelihood of being blocked? We do that for the image downloads.

[Edited to update suggestion] [Edited because I now want everything in parser.py to make things easier for the users]

Nov 13 '22 23:11 timhutton

Sounds reasonable to me. I‘ll update the PR accordingly.

Nov 14 '22 05:11 abaumg

I would actually prefer this to be a separate script - I've got Tweets going back to 2009 and I can imagine if I hammer some of those shorteners with requests, it's going to block the script. If it just looped through the generated .md files looking for the shortened links and then expanded them, it'll allow the script to abort/die/be cancelled and restarted later (perhaps from a different IP address) without any loss of progress.

It can be, and originally is, separate. :-) In that state it should be run over your archived tweets before using the parser. It is also easier to expand the list of shorteners, as I had stored them in a separate config file. I'm pretty sure there are more shorteners than I had listed.

Nov 14 '22 15:11 Atari-Frosch

In that state it should be run over your archived tweets before using the parser.

This also makes more sense to me, because I'd rather not have to run the expansion again in order to tweak the markdown layout. Even better would be to store the mapping in a separate file, so the archive remains untouched (but I can just make a backup).

Nov 14 '22 16:11 Sjors

Oops, closed this PR by accident while syncing my fork with upstream and switching branches. Will reopen.

Nov 14 '22 21:11 abaumg

As suggested, I moved everything to a separate file. A map of the expanded links is saved as *.ini file, although I'm not sure if ConfigParser is the best approach here.

In addition, the script tries to expand links in really old tweets as well, where there is no meta information, but only plain text.

Nov 14 '22 23:11 abaumg

TODOs:

fix: the mapping file is regenerated on each run, as I struggled to find existing records with ConfigParser
use mapping file in parser.py for generating the markdown

Nov 14 '22 23:11 abaumg

@abaumg Do you need to read tweet*.js in this script? We already expand the URLs in parser.py using the JSON. I had imagined expand_urls.py would work by read the *.md files and expanding URLs found there using requests.

Nov 15 '22 00:11 timhutton

Should we add other output formats besides markdown as an option, it makes sense to go over tweets*.js or else we need to create new code to add the expanded URLs per output format. Working with mapping files both for media and URLs makes life easier down the road ;)

Nov 15 '22 10:11 jwildeboer

@jwildeboer I agree that we may want mapping files for media and URLs at some point. For now I think let's adopt the simplest possible solution: it searches files for expandable URLs and replaces them.

[Edit: We now output html too, so updated comment to be more general.]

Nov 15 '22 12:11 timhutton

@timhutton IMO it's easier to parse structured tweet*.js JSON than to extract links from Markdown files, let alone parsing HTML. As @jwildeboer pointed out, mapping files make life easier. But you're the maintainer, you decide. So now that we have already two output formats, do we stick with modifying MD output files or should we go for mapping files?

Nov 19 '22 17:11 abaumg

I opened up PR #85 as the link handling for old tweets before the t.co shortener introduction is pretty much absent. Links are just plaintext in the full_text key and the entities.urls as well as the entities.media keys are just empty.

The PR adds handling for these links by extracting them from the tweet and stores them in the in-memory tweets structure. With the link-expander as an external script, these in-memory links are not visible of course as the memory structure got destroyed...

So this might be a good idea to think about here: How do we want to handle such a situation? Should we just write out a json list of links in the main parser.py? And then iterate over that list in the expand_urls.py and expand links? A second run of parser.py could then check for expanded links in the json file and export the tweets correctly?

Nov 19 '22 20:11 ixs

I've also seen zpr.io

Nov 20 '22 11:11 yoshimo

I've just read through this discussion and for me, it looks like a good solution would be like this:

the parser asks 'do you want to expand shortened urls (using online lookup)?' before all the other parsing happens.
If the mapping file (maybe just a JSON file if the config/ini format is difficult to handle?) doesn't exist yet, it is created (empty) and filled by local lookup in the archive files (the parser already does this lookup when parsing tweets, so it would only have to be moved to an earlier point in time, and the 'save to separate file' part would be new).
if the user says yes to 1., then the js files are searched for shortenend urls that can't be resolved locally (i.e. they don't already appear in the mapping file), resolves them by online lookup, and adds them to the mapping file.
whenever a shortened URL ist encountered while parsing the archive, the parser can replace it with its expanded version by looking it up from the mapping file.

If you run the parser again later, it would not do new lookups for any urls that are already saved in the mapping file, so the traffic load is kept to a minimum.

Nov 21 '22 21:11 flauschzelle

Running these remote lookups is super slow (mostly because we sleep for 0.75s between each for fear of being limited). And we might have a lot of them to do. Is there any possibility of a bulk lookup feature being available somewhere?

Nov 23 '22 02:11 timhutton

Someone on mastodon said they hammer t.co to get the redirects in parallel and have never been rate-limited. If that's true with all of the shorteners then maybe we can make this workable.

On my archive of 1371 tweets I have 267 URLs to expand. With the sleep turned off this takes 67 seconds. For someone with 10x more tweets they might be looking at 10mins to retrieve the URLs. If we code it in such a way that's it's not crucial that it finishes (as with the media downloads), and if we can paralellize it a bit or whatever, then I think we can run this at the end of parser.py.

(I'm getting the feeling that people don't mind leaving things running if they see the benefit. Someone today told me their media download took 14 hours and retrieved 12439 of 12459 files, with the missing 20 being 404s. They were delighted it had worked.)

@abaumg So to answer your question, maybe let's do the following:

If there's an existing cache urls_unshortened.txt that maps URLs to unshortened then we load and use that.
On the first pass we call parse_tweets(), parse_direct_messages()., passing the cache dict. It writes *.md *.html with the existing URLs (or the ones found in the cache) and collects the ones that need unshortening.
We then ask the user if they want to try un-shortening N URLs (estimated time and KB). (Maybe we ask about all lengthy downloads before starting any of them.)
We run the retrieval, updating the urls_unshortened.txt cache file.
We then call parse_tweets(), parse_direct_messages() again, writing out *.md *.html again, with updated urls.
All the code is in parser.py.

So this way there's only one script to run, we don't duplicate the tweet-parsing code, and we don't need to parse mds or html. If they run the script again then it will take less time (because of the cache). If the retrievals crash or the computer gets turned off then no big deal, we can carry on from where we left off.

Thoughts? Sorry for the churn on the thinking in the discussion above. We've come a long way in 10 days.

Nov 23 '22 18:11 timhutton

Ways to break up the work into smaller PRs:

just the function to attempt to unshorten a URL
just the code to make use of an unshortening cache dict in parse_tweets
just the code to return from parse_tweets/DMs a list of URLs that can be unshortened
~just the code to load and save the unshortening cache~

[Edit: I've just seen that @ixs's #83 also caches user handles, and would be easily extended to URLs.]

Nov 24 '22 14:11 timhutton

Hi, I also found URLs shortened via wp.me in my archive, is there a repository of shortening services to expand?

Feb 02 '23 12:02 slorquet

twitter-archive-parser twitter-archive-parser copied to clipboard

Expand shortened urls

twitter-archive-parser
twitter-archive-parser copied to clipboard