requests icon indicating copy to clipboard operation
requests copied to clipboard

Incorrect behavior with schemeless-dotless host:port URLs

Open itamaro opened this issue 2 years ago • 11 comments
trafficstars

URLs of the form hostname:8080 (with no scheme, with "hostname" not containing any dots) can be used to refer to the netloc "hostname:8080"

requests.utils.prepend_scheme_if_needed should correctly prepend the new_scheme when provided with such a URL.

Expected Result

the prepended-scheme URL should be "http://hostname:8080"

Actual Result

the prepended-scheme URL is "hostname:///8080" (e.g. treating the "hostname" part as the scheme, no host, no port, and "8080" as the path)

I extended the test_prepend_scheme_if_needed to demonstrate this behavior (see https://github.com/psf/requests/compare/main...itamaro:requests:schemeless-hostname-anad-port-bug)

Reproduction Steps

from requests.utils import prepend_scheme_if_needed
print(prepend_scheme_if_needed("hostname:8080", "http"))

System Information

$ python -m requests.help
{
  "chardet": {
    "version": null
  },
  "charset_normalizer": {
    "version": "3.1.0"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "3.4"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.10.4"
  },
  "platform": {
    "release": "22.4.0",
    "system": "Darwin"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.30.0"
  },
  "system_ssl": {
    "version": "101010ef"
  },
  "urllib3": {
    "version": "2.0.2"
  },
  "using_charset_normalizer": true,
  "using_pyopenssl": false
}

itamaro avatar May 13 '23 21:05 itamaro

How does this realistically affect users who aren't using utils which aren't part of the public API?

sigmavirus24 avatar May 14 '23 00:05 sigmavirus24

How does this realistically affect users who aren't using utils which aren't part of the public API?

ah sorry, I went down a rabbit hole tracking down the issue to this that I forgot to mention the user-facing scenario!

one scenario is when using such a URL as a proxy:

response = requests.get(
    "http://www.example.com",
    proxies={"http": "myproxy:8080"},
    ....
)

this used to work (with requests 2.25.1 and python 3.8), but with requests 2.27.1 it fails with

...
requests.exceptions.InvalidProxyURL: Please check proxy URL. It is malformed and could be missing the host.

another scenario (less critical) is difference in exceptions (MissingSchema vs InvalidSchema):

response = requests.get("www.example.com")
...
MissingSchema: Invalid URL 'www.example.com': No scheme supplied. Perhaps you meant http://www.example.com?

vs

response = requests.get("hostname:8080")
...
requests.exceptions.InvalidSchema: No connection adapters were found for 'hostname:8080'

itamaro avatar May 14 '23 02:05 itamaro

So the last two aren't ever supposed to work. Would it be nice if they raised the same exception? Sure. But never should we be guessing scheme. We never document that schemeless URLs are supported in that way.

The proxy case may take investigation. But I suspect we documented a breaking change there. I vaguely remember other people complaining

sigmavirus24 avatar May 14 '23 10:05 sigmavirus24

The proxy case may take investigation. But I suspect we documented a breaking change there. I vaguely remember other people complaining

Yes, it's definitely the proxy case that brought me here and is causing me issues. The other one is just a nit.

itamaro avatar May 15 '23 00:05 itamaro

Hi @itamaro, this was an intentional change in 2.27.0 due to this bug in CPython. The behavior of urlparse fundamentally changed from Python 3.9 onwards which makes supporting schemeless URIs in this case fairly difficult. The decision was made to move to urllib3's url_parse function, which is the behavior you're seeing now to limit blast radius with the the standard library updates.

This is why you're seeing an error now that wasn't occurring previously.

Python 3.10

Python 3.10.5 (main, Jul  1 2022, 17:28:53) [Clang 13.0.0 (clang-1300.0.27.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlparse
>>> urlparse('hostname:8080')
ParseResult(scheme='hostname', netloc='', path='8080', params='', query='', fragment='')

Python 3.7

Python 3.7.9 (default, Aug 11 2022, 16:47:29) 
[Clang 13.0.0 (clang-1300.0.27.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlparse
>>> urlparse('hostname:8080')
ParseResult(scheme='', netloc='', path='hostname:8080', params='', query='', fragment='')
>>> 

The only reason this happened to work previously is we did this (arguably bad) shuffle of replacing the netloc with the path for this case. That honestly should have never been done but was there to work around the oddities of url parsing in the standard library.

So the short of the story is all version of Requests will be affected by this as soon as you upgrade beyond Python 3.8. We have no reliable way to control this from Requests side anymore. I would recommend updating your usage to ensure proxies are passed with a scheme to avoid any surprises later.

Requests 2.25.1 behavior on Python 3.10

>>> response = requests.get(
...     "http://www.example.com",
...     proxies={"http": "myproxy:8080"},
...)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/adapters.py", line 412, in send
    conn = self.get_connection(request.url, proxies)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/adapters.py", line 309, in get_connection
    proxy_manager = self.proxy_manager_for(proxy)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/adapters.py", line 193, in proxy_manager_for
    manager = self.proxy_manager[proxy] = proxy_from_url(
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/urllib3/poolmanager.py", line 492, in proxy_from_url
    return ProxyManager(proxy_url=url, **kw)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/urllib3/poolmanager.py", line 429, in __init__
    raise ProxySchemeUnknown(proxy.scheme)
urllib3.exceptions.ProxySchemeUnknown: Not supported proxy scheme myproxy

nateprewitt avatar May 15 '23 17:05 nateprewitt

thanks for the background @nateprewitt !

I would recommend updating your usage to ensure proxies are passed with a scheme to avoid any surprises later.

totally agree this is the preferred solution, but it's not going to be easy doing that in our monorepo with millions of lines of code... (e.g. the entire Meta Python codebase 😬)

itamaro avatar May 16 '23 00:05 itamaro

Guys should I try solving this or is this redundant?

turingnixstyx avatar May 17 '23 22:05 turingnixstyx

@turingnixstyx I don't think there's anything to be solved. Requests cannot support schemeless proxies in Python 3.9+. It's an unfortunately tedious change for end-users, but we can't do much about the behavior in the standard library. What we're doing now is providing a consistent behavior across all Python versions.

nateprewitt avatar May 17 '23 22:05 nateprewitt

Hey @nateprewitt actually really wanted to contribute to something (starting out as a young backend developer in python) . If there's any feature/bug I can work on?

turingnixstyx avatar May 17 '23 22:05 turingnixstyx

@turingnixstyx, there's unfortunately nothing well curated for entry development on Requests at the moment. You may consider taking a look at https://github.com/urllib3/urllib3 or another Python project. Specifically, identifying issues that are labeled "Help Wanted", "Contributor Friendly" or "Good First Issue" is a good place to begin.

nateprewitt avatar May 17 '23 22:05 nateprewitt

@nateprewitt appreciate your response

turingnixstyx avatar May 18 '23 06:05 turingnixstyx