w3lib icon indicating copy to clipboard operation
w3lib copied to clipboard

canonicalize_url incorrectly handles port when using hostname that requires IDNA encoding

Open hwo411 opened this issue 1 year ago • 1 comments

Hello,

We just recently encountered the following problem:

canonicalize_url('https://тест.тест:33')

which returns https://xn--e1aybc.xn--:33-qdd4dec/

while the expected value is

https://xn--e1aybc.xn--e1aybc:33/

And that happens to every hostname that required IDNA encoding for their TLD.

Could you please fix this behavior?

hwo411 avatar Mar 06 '24 10:03 hwo411

I also discovered one more related thing with multiple dots in the end of the domain:

>>> canonicalize_url('http://example.com.../тест')
'http://example.com.../%D1%82%D0%B5%D1%81%D1%82'
>>> canonicalize_url('http://тест.тест./тест')
'http://xn--e1aybc.xn--e1aybc./%D1%82%D0%B5%D1%81%D1%82'
>>> canonicalize_url('http://тест.тест.../тест')
'http://тест.тест.../%D1%82%D0%B5%D1%81%D1%82'

As you can see, single dot is handled properly, but with 2+ dots it doesn't encode the domain at all.

Update: it seems to be an invalid url according to the standard, so maybe the behavior is correct, though in other languages some url validators accept it and handle normally. So not sure if this addendum has to be fixed, I'll revert the title back.

hwo411 avatar Mar 26 '24 07:03 hwo411