url-normalize
url-normalize copied to clipboard
Incorrect handling of reserved characters
The behaviour of url_normalize is wrong when dealing with URLs that have encoded characters like question marks ? or hash symbols # in the path.
%23 and %3F should not always be decoded into ? and # as they can change the meaning of the URL. Example:
>>> from url_normalize import url_normalize
>>> url_normalize('https://www.example.com/More+Tea+Vicar%3F/discussion')
'https://www.example.com/More+Tea+Vicar?/discussion'
Sometimes it's valid to decode these characters, e.g. example.com/?query=Hello%3F and example.com/?query=Hello? are equivalent. But the examples above are very clearly not the same URL.
I imagine this bug may affect other special characters too, but I've only tested ? and # myself.
@deains I ran into the same issue. Looks like hyperlink handles this correctly.
>>> from url_normalize import url_normalize
>>> url_normalize('https://www.example.com/More+Tea+Vicar%3F/discussion')
'https://www.example.com/More+Tea+Vicar?/discussion'
>>> import hyperlink
>>> hyperlink.URL.from_text('https://www.example.com/More+Tea+Vicar%3F/discussion').normalize().to_text()
'https://www.example.com/More+Tea+Vicar%3F/discussion'
I also just encountered this in a case where a request parameter value contains a reserved character:
>>> from url_normalize import url_normalize
>>> url_normalize('https://www.google.com/?where=code%3D123')
'https://www.google.com/?where=code=123'