twitter-archive-parser
twitter-archive-parser copied to clipboard
Correctly recognize links in full_text for old tweets
Old tweets from before twitter added t.co shortlinks do not contain the URLs in the entities.urls list. Thus they do not appear as links but just plain text in the archive. Fix that by extracting links from tweet full_text using urlparse.
Once they have been added to the entities.urls list, regular linking logic works great and links become clickable.
Added: Extract links from full_text, populating entities.links
If I add debug I get this on my archive:
-- Adding https://t.co/… as a link --
-- Adding https://t.co/PxwGCm… as a link --
-- Adding https://t.co/nCYVcXprl… as a link --
-- Adding http://t.co/gFmt… as a link --
-- Adding https://t.co/QBl2mgEZyy as a link --
-- Adding https://t.co/Y1az4BE4… as a link --
-- Adding https://t.co/CzCm… as a link --
-- Adding https://t.c… as a link --
-- Adding http://t.co… as a link --
-- Adding https://t.co/W… as a link --
-- Adding https://t.co/… as a link --
-- Adding https://t.co/ol… as a link --
-- Adding https://t… as a link --
-- Adding https://t.c… as a link --
These truncated URLs come from retweets, eg:
"full_text" : "RT @Awfidius: @MSFTResearch So exciting! See the view through my HoloLens with the Research Mode sample app running... https://t.co/PxwGCm…",
Of those URLs only one is not truncated, and it is already treated correctly as a link in the html and md.
So you're seeing something different in your archive? Can you post an example of the JSON and the resulting html / md output?
I think your twitter data is not old enough.
You have t.co links, which is the Twitter link shortener.
Tweets that have been generated before t.co came into existance do not contain any links in the entities.url key.
It's just a plain text link in the plaintext field of the tweet. There is also no truncation going on because there was no retweet function at that point in time. If you wanted a retweet, you had to put RT in front of the text manually.
An example from my tweets.js file:
{
"tweet" : {
"edit_info" : {
"initial" : {
"editTweetIds" : [
"9476040160"
],
"editableUntil" : "2010-02-22T14:05:01.000Z",
"editsRemaining" : "5",
"isEditEligible" : true
}
},
"retweeted" : false,
"source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"entities" : {
"hashtags" : [ ],
"symbols" : [ ],
"user_mentions" : [
{
"name" : "TITANIC",
"screen_name" : "titanic",
"indices" : [
"112",
"120"
],
"id_str" : "12509262",
"id" : "12509262"
}
],
"urls" : [ ]
},
"display_text_range" : [
"0",
"120"
],
"favorite_count" : "0",
"id_str" : "9476040160",
"truncated" : false,
"retweet_count" : "0",
"id" : "9476040160",
"created_at" : "Mon Feb 22 13:35:01 +0000 2010",
"favorited" : false,
"full_text" : "I am laughing sooooo hard right now: http://bit.ly/b7fc9f Bodenhansa - A Streik Alliance Member... :-) Nice one @Titanic",
"lang" : "en"
}
},
Keep in mind, that the bit.ly link has expired. Without this PR, there is no special handling for the link inside the full_text, no html a-tag etc.
If I add debug I get this on my archive:
-- Adding https://t.co/… as a link -- -- Adding https://t.co/PxwGCm… as a link -- -- Adding https://t.co/nCYVcXprl… as a link -- -- Adding http://t.co/gFmt… as a link --
Where is that debug output coming from? Not from my change, right?
I added the debugging output, printing out the value of word.
So presumably your links appear as links in the md files when rendered, because they're just plain text? The issue is that in the html output they don't get made into links? (I suggest next time raising an issue with all the info, so we can understand the problem before seeing a PR.)
I'm happy for this change to go in, to improve the html output for people with old tweets. But it should handle the truncated URLs issue I'm seeing on my archive with this change. Probably just ignoring a URL if it contains … will be enough?
bit.ly links expire? Yuck. URL shorteners are such nonsense.
I added the debugging output, printing out the value of
word.
That is unexpected. The code is gated behind a check that should prevent that in line 151: https://github.com/timhutton/twitter-archive-parser/pull/85/files#diff-df00b01568933a06c611778d2d70679891ebf6e950241d03bb2aa27f3e196fe0R151:
Only run the link extractor on tweets that do not have any entities.urls entries nor an entities.media key.
Could you please paste me the tweets.js entry for one of the tweets that get parsed?
If I look at https://t.co/QBl2mgEZyy which is the only non-truncarted link, this is a media link and I would have expected entities.media to be populated. It seems it isn't which is why your tweets.js entry is intereting.
So presumably your links appear as links in the md files when rendered, because they're just plain text? The issue is that in the html output they don't get made into links? (I suggest next time raising an issue with all the info, so we can understand the problem before seeing a PR.)
The md files output is okay as a md parser would create the link on parsing the plain text, effectively "linkyfying" the URL then. HTML is missing the link of course.
I'm happy for this change to go in, to improve the html output for people with old tweets. But it should handle the truncated URLs issue I'm seeing on my archive with this change. Probably just ignoring a URL if it contains
…will be enough?
Agreed that it makes sense to fix that part. But I am trying to understand at why this happens.
bit.ly links expire? Yuck. URL shorteners are such nonsense.
Agreed on the nonsense. And unclear why the links now just return a 404. I believe they were still returning redirects a few days ago but what do I know about bit.ly operations...
"Bitly ~shortens~ [ruins] 600 million links per month. .. Bitly makes money by charging for access to ~aggregate data~ [your activity on the internet collected by tracking cookies] created as a result of many people using the shortened URLs. .. Worth: $64 million" [0] 🤮
Full JSON for the tweet I mentioned above:
"tweet" : {
"retweeted" : false,
"source" : "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"entities" : {
"hashtags" : [ ],
"symbols" : [ ],
"user_mentions" : [
{
"name" : "Andrew Fitzgibbon",
"screen_name" : "Awfidius",
"indices" : [
"3",
"12"
],
"id_str" : "53514472",
"id" : "53514472"
},
{
"name" : "Microsoft Research",
"screen_name" : "MSFTResearch",
"indices" : [
"14",
"27"
],
"id_str" : "21457289",
"id" : "21457289"
}
],
"urls" : [ ]
},
"display_text_range" : [
"0",
"140"
],
"favorite_count" : "0",
"id_str" : "1008767532097064961",
"truncated" : false,
"retweet_count" : "0",
"id" : "1008767532097064961",
"created_at" : "Mon Jun 18 17:44:56 +0000 2018",
"favorited" : false,
"full_text" : "RT @Awfidius: @MSFTResearch So exciting! See the view through my HoloLens with the Research Mode sample app running... https://t.co/PxwGCm…",
"lang" : "en"
}
Funky. That is a retweet, with retweet: false and shortened t.co urls but no entries in entities.urls.
That shouldn't even exist and is also clearly broken. 🤣
Because what we are doing is write out that shortened link into the markdown output in it's shortened and cut off form.
I guess we would need some extra handling to cover those kind of tweets, completely separate from this PR. Maybe look up the real link online and then catch the link? Because online this retweet is shown fully.
For this PR, checking for "…" in word and disregarding the tweet in questinon seems a good idea.
Funky. That is a retweet, with
retweet: falseand [...]
BTW, all tweets, including retweets, in all my archives (that's roughly 50.000 tweets) have retweet: false. I think that attribute has absolutely no meaning.
Funky. That is a retweet, with
retweet: falseand [...]BTW, all tweets, including retweets, in all my archives (that's roughly 50.000 tweets) have
retweet: false. I think that attribute has absolutely no meaning.
You're right. Thanks for pointing that out, I think I was mixing it up with the retweet_count which is something else however.
New cut, this looks better. I checked my archive and there's actually two tweets with t.co shortlinks with no entities.urls data.
Looks to be something weird on the twitter side probably...
But we're ignoring links now that end in \u2026 which is the elypsis used by twitter.