bot icon indicating copy to clipboard operation
bot copied to clipboard

Support unicode urls filtering

Open mbaruh opened this issue 2 weeks ago • 2 comments

mbaruh avatar Dec 12 '25 21:12 mbaruh

Unquoting the entire URL can change the meaning of the URL quite severely. For example, http://example.com%2F:%[email protected]/ is a URL with the host of malicious-site.example.org, the username example.com/ and password /. But if you just unquote the entire thing, you get http://example.com/:/@malicious-site.example.org/ which has the hostname of example.com.

Another example: https://example.com/apple/banana/%2e%2e%2f%2e%2e%2fcherry semantically has the path /apple/banana/../../cherry, but if you unquote it, it will have the path /cherry. (this example isn't necessarily relevant here because we are only filtering the domain).

decorator-factory avatar Dec 13 '25 01:12 decorator-factory

To summarize some discussion with mbaruh from Discord:

  • Discord likely uses JavaScript's URL class (new URL/URL.parse etc.) to figure out what part of the message should be a hyperlink.
  • JavaScript's URL parser will handle any amount of extra slashes after http:// and discard them, so e.g. http://////////////////example.com is considered the same as http://example.com.
  • RFC 3986 and the WHATWG web standard thing both allow domain names to be percent-encoded, so http://%d0%b1%d0%b0%d0%bd%d0%b0%d0%bd.com should be interpreted the same as http://банан.com.

For now, we can probably replace http(s?)://+ with http\1://, use yarl.URL to parse the URL and manually percent-decode the host. (yarl doesn't percent-decode the host, see github discussion)

In the future, since we really want to parse URLs in the same way JavaScript does, we could use something that explicitly parses URLs according to the whatwg rules. There's a whatwg-url package on PyPI (which is "archived", but it's just one file so we can simply vendor it) that seems to fit our purposes:

>>> banana = "http:///////%d0%b1%d0%b0%d0%bd%d0%b0%d0%bd.com"
>>> whatwg_url.parse_url(banana)
<Url scheme='http' hostname='xn--80aab3cb.com' port=None path='/' query=None fragment=None>
>>>

As an alternative we could use the Rust url package which is maintained by Servo and also implements the WHATWG spec. That has the added bonus of being blazingly fast (hopefully) in case we want to process a lot of URLs (e.g.: if spam bots intentionally put a lot of URLs in a message to try DoSing the spam filters).

decorator-factory avatar Dec 13 '25 05:12 decorator-factory

one of my favorite urls that discord parses weirdly is: https://github.com/0/...

I also have a few others that are entirely unclickable in the client, I'll share them when I find them

onerandomusername avatar Dec 16 '25 07:12 onerandomusername