rfc3986 normalisation of urls containing non-ascii domains is broken and loses data

normalisation of urls containing non-ascii domains is broken and loses data

Open wbolster opened this issue 8 years ago • 2 comments

Initial parsing works:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment')
URIReference(scheme='http', authority='æåëý.com', path='/path', query='query', fragment='fragment')

Subsequent normalisation silently loses data:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment').normalize()
URIReference(scheme='http', authority=None, path='/path', query='query', fragment='fragment')

Jan 15 '16 13:01 wbolster

Correct. We do not yet handle IRIs. (RFC 3987)

Jan 16 '16 02:01 sigmavirus24

Fwiw, preprocessing by replacing the host name part with its IDNA-encoded (xn--…) equivalent using the url parsing routines from the urllib3 package, before passing it to uri_reference() sort of "works" as a work-around.

Jan 19 '16 17:01 wbolster

rfc3986 rfc3986 copied to clipboard

normalisation of urls containing non-ascii domains is broken and loses data

rfc3986
rfc3986 copied to clipboard