urlparse does not correctly handle signs, underscores, and whitespace in port numbers
Background
RFC 3986 (spec for URIs) defines a valid port string with the following grammar rule:
port = *DIGIT
Here's the WHATWG URL spec definition: """ A URL-port string must be one of the following:
- the empty string
- one or more ASCII digits representing a decimal number no greater than $2^{16} − 1$.
"""[^1]
[^1]:Given that this is urlparse and not uriparse, it seems appropriate that we do not accept port numbers outside range(2**16), even though such numbers are allowed by RFC 3986.
The bug
This is the port string parsing code from Lib/urllib/parse.py:166-176:
def port(self):
port = self._hostinfo[1]
if port is not None:
try:
port = int(port, 10)
except ValueError:
message = f'Port could not be cast to integer value as {port!r}'
raise ValueError(message) from None
if not ( 0 <= port <= 65535):
raise ValueError("Port out of range 0-65535")
return port
This will erroneously validate strings "-0" and f"+{x}" for any value of x in the valid range. Given that + and - are not digits, this behavior is in violation of both specifications.
This bug is easily reproducible with the following snippet:
from urllib.parse import urlparse
url1 = urlparse("http://python.org:-0")
url2 = urlparse("http://python.org:+80")
print(url1.port) # prints 0, but error is expected
print(url2.port) # prints 80, but error is expected
Happy to submit a PR, but don't want to step on any toes over at #25774.
My environment
- CPython version tested on:
- 3.10.6
- Operating system and architecture:
- Arch Linux x86_64
Similarly we accept urlparse("http://python.org:10_000").port.
Do you think this has any practical effect, for example on security? Most likely we'll have to go through a deprecation cycle to remove the noncompliant behavior.
We also accept urlparse("http://python.org:\r\n80\r\n").port, because int trims whitespace.
I think it's plausible that there are security concerns stemming from desyncing a front-end server using urlparse from a backend server using something else. I can imagine, for instance, another implementation of a permissive URL parser interpreting "http://python.org:80_80" as having port 80, leading to disagreement between the servers as to port in question. I could also see the use of whitespace in the port string (especially "\r\n") as a potential source of confusion for HTTP servers, because even though urlparse has no problem with "http://python.org:80\r\n", a CRLF ends the HTTP 1.1 start line and would lead to an invalid HTTP message. I'm missing the creativity right now to string all of this together into something tangible, but I feel like it could be done.
I just ran into this issue with whitespace. It didn't cause a security issue, but rather a bunch of unexpected errors:
>>> urlparse(' https://example.com ')
ParseResult(scheme='', netloc='', path=' https://example.com ', params='', query='', fragment='')
I just ran into this issue with whitespace. It didn't cause a security issue, but rather a bunch of unexpected errors:
>>> urlparse(' https://example.com ') ParseResult(scheme='', netloc='', path=' https://example.com ', params='', query='', fragment='')
This is definitely weird behavior, but maybe not a bug. RFC3986 says a URI scheme must begin with an ALPHA, so I think we're right in determining that there is no scheme (even though the RFC says there must be a scheme, I think the WHATWG spec says you can omit it in some situations. It's unclear to me.) The WHATWG spec also says that any ASCII string is a valid URL path, which is also in disagreement with the RFC. If we're going with the WHATWG interpretation, then ' https://example.com ' is a perfectly valid URL path and nothing has been parsed incorrectly. If we're going with the RFC, then urlparse's behavior is not good.
Another similar bug:
>>> urlparse("https:// example.com ")
ParseResult(scheme='https', netloc=' example.com ', path='', params='', query='', fragment='')
This time, putting space after the :// leads to spaces in the hostname. That's also wrong according to RFC, and okay according to WHATWG.
Hmmm... interesting, what about the case without a scheme with port?
>>> urlparse('www.example.com:80')
ParseResult(scheme='www.example.com', netloc='', path='80', params='', query='', fragment='')
Hmmm... interesting, what about the case without a scheme with port?
>>> urlparse('www.example.com:80') ParseResult(scheme='www.example.com', netloc='', path='80', params='', query='', fragment='')
This is just objectively wrong, and should probably be its own bug report. Good find.
Most likely we'll have to go through a deprecation cycle to remove the noncompliant behavior.
I don't think so. This is clearly a bug, not a feature someone should depend on.
All the backports are merged, thanks for reporting and fixing this issue!