synapse icon indicating copy to clipboard operation
synapse copied to clipboard

URL previews includes unicode punycode which causes issues URLs in the text body

Open HarHarLinks opened this issue 3 years ago • 5 comments

Description

https://github.com/vector-im/element-web/issues/23432

Synapse injects (5o0a) punycode into og:description of twitter URLs, leading to broken links.

Steps to reproduce

Forwarded Element issue:

Steps to reproduce

  1. Post a twitter URL into a room, URL previews enabled
  2. Twitter post content is put into double quotes ""
  3. in case of media-only posts, the t.co media URL is in quotes
  4. element includes the ending quote in the URL, resulting in a broken link

Outcome

What did you expect?

links work

What happened instead?

image

Demo URLs for reference: https://twitter.com/FXNetworks/status/1577704289476128771 https://twitter.com/mischiefanimals/status/1576904037449969664

Operating system

arch

Application version

Element Nightly version: 2022100501 Olm version: 3.2.12

How did you install the app?

aur

Homeserver

private

Synapse Version

1.68.0

Installation Method

Docker (matrixdotorg/synapse)

Platform

debian, matrix-docker-ansible-deploy

Relevant log output

n/a

Anything else that would be useful to know?

curl "https://publish.twitter.com/oembed?url=https://twitter.com/mischiefanimals/status/1576904037449969664" | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   853  100   853    0     0   3946      0 --:--:-- --:--:-- --:--:--  4140
{
  "url": "https://twitter.com/mischiefanimals/status/1576904037449969664",
  "author_name": "animals going goblin mode",
  "author_url": "https://twitter.com/mischiefanimals",
  "html": "<blockquote class=\"twitter-tweet\"><p lang=\"zxx\" dir=\"ltr\"><a href=\"https://t.co/fVP8YWHS2j\">pic.twitter.com/fVP8YWHS2j</a></p>&mdash; animals going goblin mode (@mischiefanimals) <a href=\"https://twitter.com/mischiefanimals/status/1576904037449969664?ref_src=twsrc%5Etfw\">October 3, 2022</a></blockquote>\n<script async src=\"https://platform.twitter.com/widgets.js\" charset=\"utf-8\"></script>\n",
  "width": 550,
  "height": null,
  "type": "rich",
  "cache_age": "3153600000",
  "provider_name": "Twitter",
  "provider_url": "https://twitter.com",
  "version": "1.0"
}

https://user-images.githubusercontent.com/2403652/194146661-044758f9-fefd-4744-9ef2-bd3aec094d40.png

HarHarLinks avatar Oct 05 '22 19:10 HarHarLinks

The Unicode quotation marks are being added by Twitter itself, not Synapse. This isn't something that the Synapse team will fix, since we want to avoid adding special handling for certain websites.

$ curl "https://twitter.com/mischiefanimals/status/1576904037449969664" -H "User-Agent: Synapse (bot; +https://github.com/matrix-org/synapse)" | less
...<meta data-rh="true" content="<E2><80><9C>https://t.co/fVP8YWHS2j<E2><80><9D>" property="og:description"/>...

squahtx avatar Oct 06 '22 12:10 squahtx

@squahtx you already have "special" handling for bits of Twitter: https://github.com/matrix-org/synapse/blob/develop/synapse/res/providers.json - in theory adding all of Twitter to that whitelist would resolve this.

t3chguy avatar Oct 06 '22 12:10 t3chguy

@squahtx you already have "special" handling for bits of Twitter: develop/synapse/res/providers.json - in theory adding all of Twitter to that whitelist would resolve this.

That file is essentially unused now; moments don't exist.

We backed out the changes to use oEmbed for all of Twitter in #11985 because it gives significantly worse results.

It seems that Twitter itself is giving a better description in the oEmbed though (the Synapse parsed version would be):

pic.twitter.com/fVP8YWHS2j\n\n— animals going goblin mode (@mischiefanimals)\n\nOctober 3, 2022

vs. what can be pulled from their OpenGraph tags in the HTML, which as @squahtx pulled above is:

<E2><80><9C>https://t.co/fVP8YWHS2j<E2><80><9D>

We should be preferring this, see:

https://github.com/matrix-org/synapse/blob/303b40b988bc372a81fc1a3cf62d3f5d074970ff/synapse/rest/media/v1/preview_url_resource.py#L382-L387

clokep avatar Oct 06 '22 12:10 clokep

We should be preferring this, see:

Ah, I see what's going on -- Twitter doesn't have oEmbed autodiscovery enabled, so we are only scraping the HTML in this case.

If we were to add the Twitter URLs back to the providers.json than we would only fetch the oEmbed, which would lose the image preview.

We could always scrape the given URL, but also check oEmbed info if available. Would be reasonable in terms of "try to find the most info possible", but would result in duplicate queries in some situations... it would treat autodiscovery of oEmbed more similar to the hard-coded providers list though. 🤷

clokep avatar Oct 06 '22 15:10 clokep

I downgraded to tolerable since you can easily click the link being previewed, and uncommon since I think this is an odd example where a URL was given as the only tweet content (which seems a bit weird -- but do shout if this seems incorrect).

clokep avatar Oct 06 '22 15:10 clokep