url-normalize
url-normalize copied to clipboard
Incorrect handling of reserved characters
The behaviour of url_normalize
is wrong when dealing with URLs that have encoded characters like question marks ?
or hash symbols #
in the path.
%23
and %3F
should not always be decoded into ?
and #
as they can change the meaning of the URL. Example:
>>> from url_normalize import url_normalize
>>> url_normalize('https://www.example.com/More+Tea+Vicar%3F/discussion')
'https://www.example.com/More+Tea+Vicar?/discussion'
Sometimes it's valid to decode these characters, e.g. example.com/?query=Hello%3F
and example.com/?query=Hello?
are equivalent. But the examples above are very clearly not the same URL.
I imagine this bug may affect other special characters too, but I've only tested ?
and #
myself.
@deains I ran into the same issue. Looks like hyperlink handles this correctly.
>>> from url_normalize import url_normalize
>>> url_normalize('https://www.example.com/More+Tea+Vicar%3F/discussion')
'https://www.example.com/More+Tea+Vicar?/discussion'
>>> import hyperlink
>>> hyperlink.URL.from_text('https://www.example.com/More+Tea+Vicar%3F/discussion').normalize().to_text()
'https://www.example.com/More+Tea+Vicar%3F/discussion'
I also just encountered this in a case where a request parameter value contains a reserved character:
>>> from url_normalize import url_normalize
>>> url_normalize('https://www.google.com/?where=code%3D123')
'https://www.google.com/?where=code=123'