hyperlink icon indicating copy to clipboard operation
hyperlink copied to clipboard

Allow URL query encoding to be overridden

Open kmike opened this issue 7 years ago • 5 comments

Hey, just FYI: hyperlink encodes query to UTF-8 before escaping (_encode_query_part function); this is incorrect, as query part should be encoded to page encoding before percent-escaping. See https://url.spec.whatwg.org/#url-query-string.

kmike avatar Jun 26 '17 15:06 kmike

Hey @kmike! Thanks for the report.

I should probably document this, but the WHATWG URL standard only represents a small slice of URL applications, centered around the web and browsers in particular (see the special schemes section, for instance). Hyperlink mostly targets RFC3986.

That said, I think it's a fine suggestion to allow overriding of the underlying encoding, and I'll look into doing that in the near future. :)

mahmoud avatar Jun 26 '17 17:06 mahmoud

Fair enough, thanks!

I'm probably biased, but I don't agree that web is a small slice :) Web pages generally follow WHATWG URL standard, not RFCs - nobody reads these documents anyways, browsers implement WHATWG, and content creators use browsers for testing, both for client and for server side.

kmike avatar Jun 26 '17 17:06 kmike

Ah, then allow me to douse the bias in a bit of reality: URLs are used by over 50 schemes/protocols. Some easy ones to consider that don't have any associated pages:

  • git
  • ssh
  • magnet
  • svn
  • mailto

And this doesn't include ad hoc uses of URL like what SQLAlchemy does (postgresql://scott:tiger@localhost:5432/mydatabase). WHATWG doesn't seem to want to touch any of these applications in the slightest.

All that said, the web is a huge application for URLs, so compatibility is top priority. Browser behavior is also one of the first places I look for defaults and other design optimizations, so keep those suggestions coming! :)

mahmoud avatar Jun 26 '17 17:06 mahmoud

After some agonizing time spent looking at both WHATWG and RFC3986, I suspect we should be leaning towards favoring WHATWG's rules. I am a heavy user of many non-web cases, but WHATWG rules deeply influence, for example, the behavior of external links in operating systems (LSOpenURL, xdg-open, etc.)

To address this specific issue: this would be an (optional) parameter for asText, yes? I do feel strongly that UTF-8 ought to be the default.

glyph avatar Jun 30 '17 21:06 glyph

Well, we don't do any automatic decoding, so we dodge a bit of a bullet there. For encoding we can indeed pass it to asText. But I suspect it will be an argument to .to_iri(), for decoding, as well. The unnerving part is that individual parts can theoretically be different encodings. Query string could be utf8 while the path is latin-1. So I guess we'll just accept one encoding (defaulting to utf8) and leave failing encoded parts as percent-encoded.

mahmoud avatar Jun 30 '17 21:06 mahmoud