yarl
yarl copied to clipboard
Bug when parsing a percent symbol
>>> URL('http://ex.com/%F0').path
'/%F0'
>>> URL('http://ex.com/%F0%')
URL('http://ex.com/%F0%25')
>>> URL('http://ex.com/%F0%').path
'/%25' # exptected: /%F0%25
the last percent character breaks path
python: 3.6.3 yarl: 1.2.6
@Faunris Hi
Could you please explain why you expect /%F0%
turn into /%F0%25
?
As far as I know neither %F0
nor %F0%
are valid url path sequences.
And percent symbol should be decoded within context of following two HEXDIGs.
(I could be wrong, but this is my current understanding)
@gyermolenko Hello, sorry about long answer. I think if
URL('http://ex.com/%F0').path
return '/%F0', then
URL('http://ex.com/%F0%').path
should retun '/%F0%25'
Becouse I think url decode \ encode should be equivalent
I would say URL('http://ex.com/%F0%').path
should produce '/%F0%'
.
The current behavior is buggy
I was wrong, a single percent should be percent-encoded as %25
according to RFC 3986:
Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI.
@Faunris assumption is correct:
URL('http://ex.com/%F0%').path
should produce '/%F0%25'
imho If you absolutely need to encode custom invalid sequences you should be consistent and turn every "%" into "%25". E.g. "%F0%" into "%25F0%25". Online url-encoders do that way.
Looks viable, thanks. My current vision is: the problem exists (while the case is pretty rare). It should be fixed, but we need to figure out the desired behavior. I appreciate any proposal.
Here are some results. I highlighted ones that I consider invalid in red. https://docs.google.com/spreadsheets/d/1L3IKXMUh5Ya9D_PIT9ogaFeISj_SVEWMkblme3Xiyt0
Expected results correspond to my understanding of rfc3986 . Also to 3rd party online tools (i.e. first googled result https://meyerweb.com/eric/tools/dencoder/). Although I review their results critically.
I think "%D0" is valid encoded symbol. It shouldn't additional encode.
For example: %25 is pct-encoded %F0 is pct-encoded And %D0 is pct-encoded We don't need additional encode percent symbol for all pct-encoded group.
2.1 rfc3986
pct-encoded = "%" HEXDIG HEXDIG
2.4 rfc3986
Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.
@Faunris
I think "%D0" is valid encoded symbol. It shouldn't additional encode.
which one? It is not ascii, it is not in unreserved url chars. Why is it valid?
must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.
for me it means "turn % into %25 only once, to avoid %25 turning into %2525 and so on"
@gyermolenko
which one? It is not ascii, it is not in unreserved url chars. Why is it valid?
Becouse standart say:
pct-encoded = "%" HEXDIG HEXDIG
and in terms of standard % 25 and %D0 valid symbol
Not everything that fits into "%" HEXDIG HEXDIG
is valid (i.e. can be properly decoded back).
Hence my question - what string was encoded into "%D0"?
It is discussable how %F0
should be decoded, but decoding %25
depends on the preceding characters:
>>> u = URL('/%25'); u.raw_path, u.path
('/%25', '/%')
>>> u = URL('/%F0%25'); u.raw_path, u.path
('/%F0%25', '/%25')
Should not %25
be always decoded as %
? This looks like a bug.
There is also difference between Python and Cython implementations.
Python implementation:
>>> URL('/%/%25')
URL('/%25/%25')
Cython implementation:
>>> URL('/%/%25')
URL('/%25/%2525')