yarl icon indicating copy to clipboard operation
yarl copied to clipboard

Query parameters didn't get fully encoded

Open rominf opened this issue 5 years ago • 6 comments

When I build an URL with yarl.URL.build I get the URL with partly encoded query:

from yarl import URL
scheme = 'http'
host = 'profile-srv.kama.gs'
path = '/oauth/authorize'
query = {'response_type': 'code', 'client_id': 'bEjgr4rUka8Oswy', 'redirect_uri': 'http://172.25.210.184/authorized?state=e8be712d-3858-4549-86b2-60cea46c7396'}
URL.build(scheme=scheme, host=host, path=path, query=query)
# URL('http://profile-srv.kama.gs/oauth/authorize?response_type=code&client_id=bEjgr4rUka8Oswy&redirect_uri=http://172.25.210.184/authorized?state%3De8be712d-3858-4549-86b2-60cea46c7396')

:, /, ? are not encoded, as you see.

rominf avatar Oct 02 '18 06:10 rominf

Hi, I'm GitMate.io!

It seems you've just enabled the issue triaging. I'm just scraping all issues from your repository and will give you some more information about this in a few minutes or so.

Because of the rate limit we can't scrape all information (including all comments and authors) right now - our system is already set up to scrape this in the next days over which the predictions will become more precise every day.

If you want me to use a different account for triaging your issues, simply create one and log in with it.

Sit tight!

asvetlov avatar Oct 02 '18 06:10 asvetlov

GitMate.io thinks possibly related issues are https://github.com/aio-libs/yarl/issues/210 (Comma not being encoded in GET request params), https://github.com/aio-libs/yarl/issues/158 (Support encoded parameter in URL.build), https://github.com/aio-libs/yarl/issues/102 (Document encoded parameter), https://github.com/aio-libs/yarl/issues/45 (params do not encode right), and https://github.com/aio-libs/yarl/issues/91 (Encode semicolon in query string).

asvetlov avatar Oct 02 '18 06:10 asvetlov

@rominf I found your issue here after filing a very similar one at https://github.com/aio-libs/yarl/issues/301.

My hunch about this behavior is that it is actually yarl.URL that is more closely following RFC 3986. If you look at the ABNF for RFC 3986:

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
   query         = *( pchar / "/" / "?" )

This means that / and ? are okay to be put in unencoded (literal) form in the URL query string. The reason for that, I would assume, is that the ? is unambiguous: you scan the URL for the first ?, and everything after that is the query string; once you hit any other occurrences, you already know those are literal parts of the query string parameters. Similarly, literal / characters should be okay in the query string because, since the parser has already seen the first ?, it should know it's no longer in the path component of the URL.

My own confusion from this arose from the fact that urllib.parse.urlencode() (which is used by requests, among other libraries) does not treat ? as safe; it encodes it:

>>> import urllib.parse
>>> query = {'response_type': 'code', 'client_id': 'bEjgr4rUka8Oswy', 'redirect_uri': 'http://172.25.210.184/authorized?state=e8be712d-3858-4549-86b2-60cea46c7396'}
>>> urllib.parse.urlencode(query)
'response_type=code&client_id=bEjgr4rUka8Oswy&redirect_uri=http%3A%2F%2F172.25.210.184%2Fauthorized%3Fstate%3De8be712d-3858-4549-86b2-60cea46c7396'

What it all comes down to is that urllib.parse.urlencode eventually calls quote() from that same module. And here is the description of quote():

Each part of a URL, e.g. the path info, the query, etc., has a different set of reserved characters that must be quoted. The quote function offers a cautious (not minimal) way to quote a string for most of these parts.

bsolomon1124 avatar Apr 11 '19 20:04 bsolomon1124

For a different discussion see: https://github.com/psf/requests/issues/794

bsolomon1124 avatar Apr 11 '19 20:04 bsolomon1124

Not escaping the ? in query arguments is causing many web servers to reject the connections (I crawl >100.000 pages on more than >10.000 web servers through aiohttp which uses yarl intensively).

The standard does not force "?" from being encoded but does not forbid it either. Consequently, there are 2 alternatives which are standard-compliant and not one. Of these 2, the only alternative that seems webservers-compatible is to always encode it.

dmoklaf avatar Dec 07 '20 12:12 dmoklaf

Another concrete case of this documented here: https://github.com/aio-libs/aiohttp/issues/5319#issuecomment-1178507468

dralley avatar Jul 08 '22 04:07 dralley