hyperlink
hyperlink copied to clipboard
Allow URL query encoding to be overridden
Hey, just FYI: hyperlink encodes query to UTF-8 before escaping (_encode_query_part
function); this is incorrect, as query part should be encoded to page encoding before percent-escaping. See https://url.spec.whatwg.org/#url-query-string.
Hey @kmike! Thanks for the report.
I should probably document this, but the WHATWG URL standard only represents a small slice of URL applications, centered around the web and browsers in particular (see the special schemes section, for instance). Hyperlink mostly targets RFC3986.
That said, I think it's a fine suggestion to allow overriding of the underlying encoding, and I'll look into doing that in the near future. :)
Fair enough, thanks!
I'm probably biased, but I don't agree that web is a small slice :) Web pages generally follow WHATWG URL standard, not RFCs - nobody reads these documents anyways, browsers implement WHATWG, and content creators use browsers for testing, both for client and for server side.
Ah, then allow me to douse the bias in a bit of reality: URLs are used by over 50 schemes/protocols. Some easy ones to consider that don't have any associated pages:
- git
- ssh
- magnet
- svn
- mailto
And this doesn't include ad hoc uses of URL like what SQLAlchemy does (postgresql://scott:tiger@localhost:5432/mydatabase
). WHATWG doesn't seem to want to touch any of these applications in the slightest.
All that said, the web is a huge application for URLs, so compatibility is top priority. Browser behavior is also one of the first places I look for defaults and other design optimizations, so keep those suggestions coming! :)
After some agonizing time spent looking at both WHATWG and RFC3986, I suspect we should be leaning towards favoring WHATWG's rules. I am a heavy user of many non-web cases, but WHATWG rules deeply influence, for example, the behavior of external links in operating systems (LSOpenURL, xdg-open, etc.)
To address this specific issue: this would be an (optional) parameter for asText
, yes? I do feel strongly that UTF-8 ought to be the default.
Well, we don't do any automatic decoding, so we dodge a bit of a bullet there. For encoding we can indeed pass it to asText
. But I suspect it will be an argument to .to_iri()
, for decoding, as well. The unnerving part is that individual parts can theoretically be different encodings. Query string could be utf8 while the path is latin-1. So I guess we'll just accept one encoding (defaulting to utf8) and leave failing encoded parts as percent-encoded.