yarl icon indicating copy to clipboard operation
yarl copied to clipboard

Bug when parsing a percent symbol

Open faunris opened this issue 6 years ago • 13 comments

>>> URL('http://ex.com/%F0').path
'/%F0'
>>> URL('http://ex.com/%F0%')
URL('http://ex.com/%F0%25')
>>> URL('http://ex.com/%F0%').path
'/%25' # exptected: /%F0%25

the last percent character breaks path

python: 3.6.3 yarl: 1.2.6

faunris avatar Aug 29 '18 09:08 faunris

@Faunris Hi Could you please explain why you expect /%F0% turn into /%F0%25? As far as I know neither %F0 nor %F0% are valid url path sequences. And percent symbol should be decoded within context of following two HEXDIGs. (I could be wrong, but this is my current understanding)

gyermolenko avatar Oct 13 '18 17:10 gyermolenko

@gyermolenko Hello, sorry about long answer. I think if

URL('http://ex.com/%F0').path

return '/%F0', then

URL('http://ex.com/%F0%').path

should retun '/%F0%25'

Becouse I think url decode \ encode should be equivalent

faunris avatar Oct 30 '18 06:10 faunris

I would say URL('http://ex.com/%F0%').path should produce '/%F0%'. The current behavior is buggy

asvetlov avatar Oct 30 '18 23:10 asvetlov

I was wrong, a single percent should be percent-encoded as %25 according to RFC 3986:

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI.

@Faunris assumption is correct: URL('http://ex.com/%F0%').path should produce '/%F0%25'

asvetlov avatar Dec 21 '18 22:12 asvetlov

imho If you absolutely need to encode custom invalid sequences you should be consistent and turn every "%" into "%25". E.g. "%F0%" into "%25F0%25". Online url-encoders do that way.

gyermolenko avatar Dec 22 '18 08:12 gyermolenko

Looks viable, thanks. My current vision is: the problem exists (while the case is pretty rare). It should be fixed, but we need to figure out the desired behavior. I appreciate any proposal.

asvetlov avatar Dec 22 '18 12:12 asvetlov

Here are some results. I highlighted ones that I consider invalid in red. https://docs.google.com/spreadsheets/d/1L3IKXMUh5Ya9D_PIT9ogaFeISj_SVEWMkblme3Xiyt0

Expected results correspond to my understanding of rfc3986 . Also to 3rd party online tools (i.e. first googled result https://meyerweb.com/eric/tools/dencoder/). Although I review their results critically.

gyermolenko avatar Dec 24 '18 10:12 gyermolenko

I think "%D0" is valid encoded symbol. It shouldn't additional encode.

For example: %25 is pct-encoded %F0 is pct-encoded And %D0 is pct-encoded We don't need additional encode percent symbol for all pct-encoded group.

2.1 rfc3986

pct-encoded = "%" HEXDIG HEXDIG

2.4 rfc3986

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

faunris avatar Dec 24 '18 10:12 faunris

@Faunris

I think "%D0" is valid encoded symbol. It shouldn't additional encode.

which one? It is not ascii, it is not in unreserved url chars. Why is it valid?

must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

for me it means "turn % into %25 only once, to avoid %25 turning into %2525 and so on"

gyermolenko avatar Dec 24 '18 11:12 gyermolenko

@gyermolenko

which one? It is not ascii, it is not in unreserved url chars. Why is it valid?

Becouse standart say:

pct-encoded = "%" HEXDIG HEXDIG

and in terms of standard % 25 and %D0 valid symbol

faunris avatar Dec 24 '18 12:12 faunris

Not everything that fits into "%" HEXDIG HEXDIG is valid (i.e. can be properly decoded back). Hence my question - what string was encoded into "%D0"?

gyermolenko avatar Dec 24 '18 13:12 gyermolenko

It is discussable how %F0 should be decoded, but decoding %25 depends on the preceding characters:

>>> u = URL('/%25'); u.raw_path, u.path
('/%25', '/%')
>>> u = URL('/%F0%25'); u.raw_path, u.path
('/%F0%25', '/%25')

Should not %25 be always decoded as %? This looks like a bug.

serhiy-storchaka avatar Sep 27 '20 13:09 serhiy-storchaka

There is also difference between Python and Cython implementations.

Python implementation:

>>> URL('/%/%25')
URL('/%25/%25')

Cython implementation:

>>> URL('/%/%25')
URL('/%25/%2525')

serhiy-storchaka avatar Sep 27 '20 13:09 serhiy-storchaka