w3lib
w3lib copied to clipboard
`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered
canonicalize_url will decode all percent encoded elements in a string.
if a hash # is present as a percent-encoded entity (%23) it will be decoded... however it shouldn't be as it is a rfc delimiter and fundamentally changes the URL structure -- turning the subsequent characters into a fragment. one of the effects is that a canonical url with a safely encoded hash will point to another url; another is that running the canonical on the output will return a different url.
example:
>>> import w3lib.url
>>> url = "https://example.com/path/to/foo%20bar%3a%20biz%20%2376%2c%20bang%202017#bash"
>>> canonical = w3lib.url.canonicalize_url(url)
>>> print canonical
https://example.com/path/to/foo%20bar:%20biz%20#76,%20bang%202017
>>> canonical2 = w3lib.url.canonicalize_url(canonical)
>>> print canonical2
https://example.com/path/to/foo%20bar:%20biz%20
what is presented as a fragment in "canonical": #76,%20bang%202017 is part of the valid url - not a fragment - and is discarded when canonicalize_url is run again.
references:
- https://github.com/scrapy/w3lib/issues/5
This offending unquoting happens in w3lib.url._unquotepath.
It only considers / and ?
What I commented earlier is not really relevant.
The issue is not so much unquoting %23 but instead not percent-encoding # afterwards when re-building the URI. (And maybe ? should not be there in _unquotepath even. The URL being already parsed into parts before unquoting the path, it should be correct to percent-decode all but %2F (/).)
In RFC 3986, pchars, valid characters for each path segment, are:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
So, as / is used as path segment delimiter, and % is there for untouched percent-encoded chars, I believe safe characters (not needing percent-encoding) in the path component are:
unreserved / sub-delims / ":" / "@" / "/" / "%"
>>> print(w3lib.url._pchar_safe_chars.decode('ascii'))
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~!$&'()*+,;=:@/%
w3lib.url.canonicalize_url() currently uses these _safe_chars for quoting the path:
>>> print(w3lib.url._safe_chars.decode('ascii'))
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()
The difference being:
>>> set(w3lib.url._safe_chars.decode('ascii')) - set(w3lib.url._pchar_safe_chars.decode('ascii'))
{'?', '#', '|'}
So indeed, # should be quoted. ? is currently "protected" (not-decoded) in _unquotepath. And | is the subject of #25 and #80.
I may have missed something, and reading RFC 3986 always confuses me at some point or another. So I guess we'll need to work on more tests and clean canonicalize_url and safe_url_string to match browsers better.
And maybe @kmike 's #25 already does the right thing already :)
Sorry, I haven't looked at this issue in detail yet, but wanted to note that canonicalize_url is not guaranteed to preserve URL. URLs produced by canonicalize_url results should be used for deduplication and alike use cases. Of course, if there's an issue (it seems there is), we should make canonicalize_url work in a more reasonable way, but keep in mind that downloading results of canonicalize_url is usually a mistake.
@redapple that RFC has driven me crazy for years. I didn't jump into the function and how the vars you referenced are used, but the % is pseudo-safe -- it is valid to appear in the URL but must be percent encoded as %25.
@kmike Yes, I understand - canonicalize_url really only gives a fingerprint of the url.
Characters sets and the utility of outputs aside, I think a better explanation (and what unit test could check against) is a concept that the output of canonicalize_url(url) should be final/terminal in that it can not be canonicalized any further and should "point to itself" if the canonicalize function is run on it. The current behavior of #/%23 breaks this concept, and you wind up with a canonicalized url that will point to a completely different resource if the function is applied to it.
canonicalize_url(url) == canonicalize_url(canonicalize_url(url))
I created two PRs (and a testcase) with some potential approaches to address this..
#93 - handles the _safe_chars differently, more in line with the above notes by @redapple
#94 - applies a transformation in _unquotepath, which should only be a path (e.g. not a fragment).
any chance someone has input on the above approaches? if either looks like a candidate for inclusion, I would be excited to generate a new PR against master that passes travis on py2 and py3.
@jvanasco I like your #93 approach, but you’ll have to rewrite it due to conflicts with later changes (e.g. #25).
I would strongly suggest making your change as simple as possible. I think you could simply add the following two lines after the current _safe_chars definition:
# see https://github.com/scrapy/w3lib/issues/91
_safe_chars = _safe_chars.replace(b'#', b'')
With that and your tests I think #93 should be ready to merge.
Thanks for the fast response, @Gallaecio. It looks like you're right! just updating _safe_chars works and passes all tests. I wasn't too familiar with the innerworkings of this library, which is why I originally created a separate _safe_chars_component variable.
I'm going to ensure the tests cover everything against some production code, then make another PR.