url-normalize Incorrect handling of reserved characters

Incorrect handling of reserved characters

Open deains opened this issue 3 years ago • 2 comments

The behaviour of url_normalize is wrong when dealing with URLs that have encoded characters like question marks ? or hash symbols # in the path.

%23 and %3F should not always be decoded into ? and # as they can change the meaning of the URL. Example:

>>> from url_normalize import url_normalize
>>> url_normalize('https://www.example.com/More+Tea+Vicar%3F/discussion')
'https://www.example.com/More+Tea+Vicar?/discussion'

Sometimes it's valid to decode these characters, e.g. example.com/?query=Hello%3F and example.com/?query=Hello? are equivalent. But the examples above are very clearly not the same URL.

I imagine this bug may affect other special characters too, but I've only tested ? and # myself.

May 04 '21 15:05 deains

@deains I ran into the same issue. Looks like hyperlink handles this correctly.

>>> from url_normalize import url_normalize
>>> url_normalize('https://www.example.com/More+Tea+Vicar%3F/discussion')
'https://www.example.com/More+Tea+Vicar?/discussion'

>>> import hyperlink
>>> hyperlink.URL.from_text('https://www.example.com/More+Tea+Vicar%3F/discussion').normalize().to_text()
'https://www.example.com/More+Tea+Vicar%3F/discussion'

Oct 14 '21 14:10 kirubakaran

I also just encountered this in a case where a request parameter value contains a reserved character:

>>> from url_normalize import url_normalize

>>> url_normalize('https://www.google.com/?where=code%3D123')
'https://www.google.com/?where=code=123'

Mar 31 '23 00:03 JWCook

url-normalize url-normalize copied to clipboard

Incorrect handling of reserved characters

url-normalize
url-normalize copied to clipboard