twitrssme icon indicating copy to clipboard operation
twitrssme copied to clipboard

Twitter image handling changes

Open ADSTfowlerj opened this issue 6 years ago • 1 comments

It appears that images aren't being parsed and fed correctly.

They end up being spit out as: <a class="twitter_external_link dir-ltr tco-link has-expanded-path" dir="ltr" href="https://t.co/4fds5KsQJe" rel="nofollow" target="_top">pic.twitter.com/4fds5KsQJe</a> which links back to the entire tweet, not just the image.

The URL is should be locating (as it used to): https://pbs.twimg.com/media/EA5PCRpXUAEh47V?format=png&name=900x900

There is a unique label: aria-label="Image"

and then two opportunities to grab the correct pbs.twimg.com URL from the code for the tweet page (example follows):

<div aria-label="Image" class="css-1dbjc4n r-1p0dtai r-1mlwlqe r-1d2f490 r-p1pxzi r-11wrixw r-1mnahxq r-1udh08x r-u8s1d r-zchlnj r-ipm5af r-417010" style="margin-right: 0px;"><div class="css-1dbjc4n r-1niwhzg r-vvn4in r-u6sd8q r-4gszlv r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-zchlnj r-ipm5af r-13qz1uu r-1wyyakw" style="background-image: url(&quot;https://pbs.twimg.com/media/EA0_J7dWsAILa4c?format=jpg&amp;name=900x900&quot;);"></div><img alt="Image" draggable="false" src="https://pbs.twimg.com/media/EA0_J7dWsAILa4c?format=jpg&amp;name=900x900" class="css-9pa8cd"></div>

for the user main page with the timeline of tweets, it looks like this: <div aria-label="Image" class="css-1dbjc4n r-1p0dtai r-1mlwlqe r-1d2f490 r-p1pxzi r-11wrixw r-1mnahxq r-1udh08x r-u8s1d r-zchlnj r-ipm5af r-417010" style="margin-right: -27%;"><div class="css-1dbjc4n r-1niwhzg r-vvn4in r-u6sd8q r-4gszlv r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-zchlnj r-ipm5af r-13qz1uu r-1wyyakw" style="background-image: url(&quot;https://pbs.twimg.com/media/EAGg0KNWwAE4asJ?format=jpg&amp;name=360x360&quot;);"></div><img alt="Image" draggable="false" src="./ADST_Twitter_page_files/EAGg0KNWwAE4asJ(1)" class="css-9pa8cd"></div>

The best option seems to be following the url after the style="background-image:: style="background-image: url(&quot;https://pbs.twimg.com/media/EA0_J7dWsAILa4c?format=jpg&amp;name=900x900&quot;);"

The code for pulling out and formatting the links is at line 108 of "mobile_twitter_to_rss.pl": # Fix pic.twitter.com links. $body =~ s{href="https://t\.co/[A-Za-z0-9]+">(pic\.twitter\.com/[A-Za-z0-9]+)}{href="https://$1">$1</a>}g; $body=~s{<a[^>]+href="https://t.co[^"]+"[^>]+title="([^"]+)"[^>]*>}{ <a href="$1">}gi; # experimental! stop links going via t.co; if an a has a title use it as the href. $body=~s{<a[^>]+title="([^"]+)"[^>]+href="https://t.co[^"]+"[^>]*>}{ <a href="$1">}gi; # experimental! stop links going via t.co; if an a has a title use it as the href. $body=~s{target="_blank"}{}gi; $body=~s{</?s[^>]*>}{}gi; $body=~s{data-[\w\-]+="[^"]+"}{}gi; # validator doesn't like data-aria markup that we get from twitter

I've just no idea how to do the regex to pull that out of what we are getting during the scrape. Help please? If no one else is seeing this problem, please let me know that too.

ADSTfowlerj avatar Aug 01 '19 19:08 ADSTfowlerj

I have the same problem, also other tools that generates rss feeds from twitter gets the same result !

rapha8l avatar Aug 04 '19 15:08 rapha8l