rfc3986 icon indicating copy to clipboard operation
rfc3986 copied to clipboard

normalisation of urls containing non-ascii domains is broken and loses data

Open wbolster opened this issue 8 years ago • 2 comments

Initial parsing works:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment')
URIReference(scheme='http', authority='æåëý.com', path='/path', query='query', fragment='fragment')

Subsequent normalisation silently loses data:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment').normalize()
URIReference(scheme='http', authority=None, path='/path', query='query', fragment='fragment')

wbolster avatar Jan 15 '16 13:01 wbolster

Correct. We do not yet handle IRIs. (RFC 3987)

sigmavirus24 avatar Jan 16 '16 02:01 sigmavirus24

Fwiw, preprocessing by replacing the host name part with its IDNA-encoded (xn--…) equivalent using the url parsing routines from the urllib3 package, before passing it to uri_reference() sort of "works" as a work-around.

wbolster avatar Jan 19 '16 17:01 wbolster