w3lib `canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered

canonicalize_url will decode all percent encoded elements in a string.

if a hash # is present as a percent-encoded entity (%23) it will be decoded... however it shouldn't be as it is a rfc delimiter and fundamentally changes the URL structure -- turning the subsequent characters into a fragment. one of the effects is that a canonical url with a safely encoded hash will point to another url; another is that running the canonical on the output will return a different url.

example:

>>> import w3lib.url
>>> url = "https://example.com/path/to/foo%20bar%3a%20biz%20%2376%2c%20bang%202017#bash"
>>> canonical = w3lib.url.canonicalize_url(url)
>>> print canonical
https://example.com/path/to/foo%20bar:%20biz%20#76,%20bang%202017
>>> canonical2 = w3lib.url.canonicalize_url(canonical)
>>> print canonical2
https://example.com/path/to/foo%20bar:%20biz%20

what is presented as a fragment in "canonical": #76,%20bang%202017 is part of the valid url - not a fragment - and is discarded when canonicalize_url is run again.

references:

https://github.com/scrapy/w3lib/issues/5

May 19 '17 01:05 jvanasco

This offending unquoting happens in w3lib.url._unquotepath. It only considers / and ?

May 19 '17 10:05 redapple

What I commented earlier is not really relevant.

The issue is not so much unquoting %23 but instead not percent-encoding # afterwards when re-building the URI. (And maybe ? should not be there in _unquotepath even. The URL being already parsed into parts before unquoting the path, it should be correct to percent-decode all but %2F (/).)

In RFC 3986, pchars, valid characters for each path segment, are:

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

So, as / is used as path segment delimiter, and % is there for untouched percent-encoded chars, I believe safe characters (not needing percent-encoding) in the path component are:

unreserved / sub-delims / ":" / "@" / "/" / "%"

>>> print(w3lib.url._pchar_safe_chars.decode('ascii'))
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~!$&'()*+,;=:@/%

w3lib.url.canonicalize_url() currently uses these _safe_chars for quoting the path:

>>> print(w3lib.url._safe_chars.decode('ascii'))
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()

The difference being:

>>> set(w3lib.url._safe_chars.decode('ascii')) - set(w3lib.url._pchar_safe_chars.decode('ascii'))
{'?', '#', '|'}

So indeed, # should be quoted. ? is currently "protected" (not-decoded) in _unquotepath. And | is the subject of #25 and #80.

I may have missed something, and reading RFC 3986 always confuses me at some point or another. So I guess we'll need to work on more tests and clean canonicalize_url and safe_url_string to match browsers better. And maybe @kmike 's #25 already does the right thing already :)

May 19 '17 13:05 redapple

Sorry, I haven't looked at this issue in detail yet, but wanted to note that canonicalize_url is not guaranteed to preserve URL. URLs produced by canonicalize_url results should be used for deduplication and alike use cases. Of course, if there's an issue (it seems there is), we should make canonicalize_url work in a more reasonable way, but keep in mind that downloading results of canonicalize_url is usually a mistake.

May 19 '17 14:05 kmike

@redapple that RFC has driven me crazy for years. I didn't jump into the function and how the vars you referenced are used, but the % is pseudo-safe -- it is valid to appear in the URL but must be percent encoded as %25.

@kmike Yes, I understand - canonicalize_url really only gives a fingerprint of the url.

Characters sets and the utility of outputs aside, I think a better explanation (and what unit test could check against) is a concept that the output of canonicalize_url(url) should be final/terminal in that it can not be canonicalized any further and should "point to itself" if the canonicalize function is run on it. The current behavior of #/%23 breaks this concept, and you wind up with a canonicalized url that will point to a completely different resource if the function is applied to it.

 canonicalize_url(url) == canonicalize_url(canonicalize_url(url))

May 19 '17 17:05 jvanasco

I created two PRs (and a testcase) with some potential approaches to address this..

#93 - handles the _safe_chars differently, more in line with the above notes by @redapple #94 - applies a transformation in _unquotepath, which should only be a path (e.g. not a fragment).

May 26 '17 22:05 jvanasco

any chance someone has input on the above approaches? if either looks like a candidate for inclusion, I would be excited to generate a new PR against master that passes travis on py2 and py3.

May 02 '19 23:05 jvanasco

@jvanasco I like your #93 approach, but you’ll have to rewrite it due to conflicts with later changes (e.g. #25).

I would strongly suggest making your change as simple as possible. I think you could simply add the following two lines after the current _safe_chars definition:

# see https://github.com/scrapy/w3lib/issues/91
_safe_chars = _safe_chars.replace(b'#', b'')

With that and your tests I think #93 should be ready to merge.

May 03 '19 09:05 Gallaecio

Thanks for the fast response, @Gallaecio. It looks like you're right! just updating _safe_chars works and passes all tests. I wasn't too familiar with the innerworkings of this library, which is why I originally created a separate _safe_chars_component variable.

I'm going to ensure the tests cover everything against some production code, then make another PR.

May 03 '19 18:05 jvanasco

w3lib w3lib copied to clipboard

`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered

w3lib
w3lib copied to clipboard