yarl
yarl copied to clipboard
ValueError on path remove_dot_segments when there's extra dot-dot (`/../`) segments
🐞 Describe the bug
As described in RFC 3986 § 5.2.4. Remove Dot Segments, the remove_dot_segments
algorithm removes any extra /../
parts of the URL, ignoring errors when the stack is empty.
However, yarl.URL()
behavior at the moment is to raise an exception, ValueError
, when that happens.
💡 To Reproduce Instantiate URL class with such a URL string:
yarl.URL('https://example.com/alice/../../eve')
💡 Expected behavior Follow the RFC, and ignore the error, to get it working like this:
In [16]: yarl.URL('https://example.com/alice/../eve')
Out[16]: URL('https://example.com/eve')
In [17]: yarl.URL('https://example.com/alice/../../eve')
Out[17]: URL('https://example.com/eve')
In [18]: yarl.URL('https://example.com/alice/../../../eve')
Out[18]: URL('https://example.com/eve')
We probably want to also test (and fix, if needed), any other parts of the API that would involve path resolution (including remove_dot_segments
) steps.
📋 Logs/tracebacks
In [15]: yarl.URL('https://example.com/alice/../../eve')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-e72c86c9e132> in <module>()
----> 1 yarl.URL('https://example.com/alice/../../eve')
~/ans/venv/lib/python3.6/site-packages/yarl/_url.py in __new__(cls, val, encoded, strict)
181 path = cls._normalize_path(path)
182
--> 183 cls._validate_authority_uri_abs_path(host=host, path=path)
184 query = cls._QUERY_REQUOTER(val[3])
185 fragment = cls._FRAGMENT_REQUOTER(val[4])
~/ans/venv/lib/python3.6/site-packages/yarl/_url.py in _validate_authority_uri_abs_path(host, path)
675 if len(host) > 0 and len(path) > 0 and not path.startswith("/"):
676 raise ValueError(
--> 677 "Path in a URL with authority should start with a slash ('/') if set"
678 )
679
ValueError: Path in a URL with authority should start with a slash ('/') if set
📋 Your version of the Python
$ python --version
Python 3.6.10
📋 Your version of the aiohttp/yarl/multidict distributions
$ python -m pip show aiohttp
Name: aiohttp
Version: 3.6.2
$ python -m pip show multidict
Name: multidict
Version: 4.7.6
$ python -m pip show yarl
Name: yarl
Version: 1.6.2
📋 Additional context
The proposed behavior would put yarl.URL
on par with DOM's URL
object, which already follows the RFC on this:
> String(new URL("https://example.com/alice/../eve"))
< "https://example.com/eve"
> String(new URL("https://example.com/alice/../../eve"))
< "https://example.com/eve"
> String(new URL("https://example.com/alice/../../../eve"))
< "https://example.com/eve"
As well as rust-url
library (from the unit tests):
{
"input": "http://example.com/foo/bar/../ton/../../a",
// ...
"pathname": "/a",
// ...
},
{
"input": "http://example.com/foo/../../..",
// ...
"pathname": "/",
// ...
},
{
"input": "http://example.com/foo/../../../ton",
// ...
"pathname": "/ton",
// ...
},
which has tests based on (warning: broken links):
"# Based on http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js",
"# AS OF https://github.com/jsdom/whatwg-url/commit/35f04dfd3048cf6362f4398745bb13375c5020c2",
Interesting. I (maybe wrong) read the RFC as "remove_dot_segments is applicable to path joining procedure only". I agree that the same behavior as mentioned JavaScript and Rust libraries is desirable.
A pull request is welcome!
So, here's the relevant pieces from the RFC:
And:
And the algorithm under section 5.2.2. Transform References.
I think what we need to pay attention to is the difference between a "Reference URL" and a "Target URL". A Target URL is expected to be resolved and normalized, meaning that it's expected to not have any .
or ..
sequences.
From that perspective, I believe the input to yarl.URL()
is expected to be a Reference URL. Does that sound about right?
We could also consider comparing the rules with https://url.spec.whatwg.org/ , which seem to be more close to the common "URL Library" implementations.
Anyways, since we do desire the compatibility, as suggested above, I'm going to get it to work as suggestion.
which has tests based on (warning: broken links):
"# Based on http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js", "# AS OF https://github.com/jsdom/whatwg-url/commit/35f04dfd3048cf6362f4398745bb13375c5020c2",
The last revision in the Webkit LayoutTests repository with such a file is 149680; the next rev folded the segments.js
file into segments.html
. It doesn't actually contain any tests for ..
and .
handling.
If you were looking at urltestdata.json
(on which the Rust tests are actually based), you may have missed that there are multiple # Based on
comments, the relevant section is based on paths.js
in the same Webkit test suite, nowadays at LayoutTests/fast/url/paths.html
I've converted those tests from JSON to a small Python script:
from yarl import URL
from urllib.parse import unquote, quote
cases = [
("/././foo", "/foo"),
("/./.foo", "/.foo"),
("/foo/.", "/foo/"),
("/foo/./", "/foo/"),
("/foo/bar/..", "/foo/"),
("/foo/bar/../", "/foo/"),
("/foo/..bar", "/foo/..bar"),
("/foo/bar/../ton", "/foo/ton"),
("/foo/bar/../ton/../../a", "/a"),
("/foo/../../..", "/"),
("/foo/../../../ton", "/ton"),
("/foo/%2e", "/foo/"),
("/foo/%2e%2", "/foo/%2e%2"),
("/foo/%2e./%2e%2e/.%2e/%2e.bar", "/%2e.bar"),
("////../..", "//"),
("/foo/bar//../..", "/foo/"),
("/foo/bar//..", "/foo/bar/"),
("/foo", "/foo"),
("/%20foo", "/%20foo"),
("/foo%", "/foo%"),
("/foo%2", "/foo%2"),
("/foo%2zbar", "/foo%2zbar"),
("/foo%2©zbar", "/foo%2%C3%82%C2%A9zbar"),
("/foo%41%7a", "/foo%41%7a"),
("/foo\t\u0091%91", "/foo%C2%91%91"),
("/foo%00%51", "/foo%00%51"),
("/(%28:%3A%29)", "/(%28:%3A%29)"),
("/%3A%3a%3C%3c", "/%3A%3a%3C%3c"),
("/foo\tbar", "/foobar"),
("\\\\foo\\\\bar", "//foo//bar"),
("/%7Ffp3%3Eju%3Dduvgw%3Dd", "/%7Ffp3%3Eju%3Dduvgw%3Dd"),
("/@asdf%40", "/@asdf%40"),
("/你好你好", "/%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD"),
("/‥/foo", "/%E2%80%A5/foo"),
("//foo", "/%EF%BB%BF/foo"),
("/\u202e/foo/\u202d/bar", "/%E2%80%AE/foo/%E2%80%AD/bar"),
]
requote = False
for i, (test, expected) in enumerate(cases, 1):
if requote:
expected = quote(
unquote(expected, encoding="latin1"), safe="/@:()=", encoding="latin1"
)
try:
actual = URL(f"http://example.com{test}").raw_path
except ValueError as e:
actual = f"exception: {e}"
if actual != expected:
print(f"({i}) \x1b[31mFAIL\x1b[0m: {test}")
print(f" {actual}\n != {expected}")
and this outputs:
(11) FAIL: /foo/../../../ton
exception: Path in a URL with authority should start with a slash ('/') if set
!= /ton
(13) FAIL: /foo/%2e%2
/foo/.%252
!= /foo/%2e%2
(14) FAIL: /foo/%2e./%2e%2e/.%2e/%2e.bar
exception: Path in a URL with authority should start with a slash ('/') if set
!= /%2e.bar
(20) FAIL: /foo%
/foo%25
!= /foo%
(21) FAIL: /foo%2
/foo%252
!= /foo%2
(22) FAIL: /foo%2zbar
/foo%252zbar
!= /foo%2zbar
(23) FAIL: /foo%2©zbar
/foo%252%C3%82%C2%A9zbar
!= /foo%2%C3%82%C2%A9zbar
(24) FAIL: /foo%41%7a
/fooAz
!= /foo%41%7a
(26) FAIL: /foo%00%51
/foo%00Q
!= /foo%00%51
(27) FAIL: /(%28:%3A%29)
/((::))
!= /(%28:%3A%29)
(28) FAIL: /%3A%3a%3C%3c
/::%3C%3C
!= /%3A%3a%3C%3c
(30) FAIL: \\foo\\bar
/
!= //foo//bar
(31) FAIL: /%7Ffp3%3Eju%3Dduvgw%3Dd
/%7Ffp3%3Eju=duvgw=d
!= /%7Ffp3%3Eju%3Dduvgw%3Dd
(32) FAIL: /@asdf%40
/@asdf@
!= /@asdf%40
Several of these are not actual failures; yarl.URL()
gives us equivalent paths with valid %hh
substitutions for 13, 20-24, 26-28, 31 and 32, and in addition 28 also replaced %3c
with %3C
, so uppercase hex instead of lowercase. Those are fine, and those disappear when you set requote
to True
.
That leaves 11, 14 and 30:
(11) FAIL: /foo/../../../ton
exception: Path in a URL with authority should start with a slash ('/') if set
!= /ton
(14) FAIL: /foo/%2e./%2e%2e/.%2e/%2e.bar
exception: Path in a URL with authority should start with a slash ('/') if set
!= /%2e.bar
(30) FAIL: \\foo\\bar
/
!= //foo//bar
I'm not so sure about turning backslashes into forward slashes nor am I inclined to research that specific issue right now.
The following alteration to the _normalize_path()
loop makes 11 and 14 pass:
for seg in segments:
if seg not in (".", ".."):
resolved_path.append(seg)
elif seg == ".." and (len(resolved_path) > 1 or resolved_path[-1]):
resolved_path.pop()
# seg == "."
Thanks for working on this, @mjpieters! I generally agree with the direction here.
Specifically, I think not-raising-an-exception is the most-important problem to fix here, as it causes processing of user-provided data to fail in a place where no such errors are expected.
Here's an alternative version, one avoids the (slower) if
check for each path segment like the original did:
prefix, resolved_path = "", []
if path.startswith("/"):
# preserve the "/" root element of absolute paths.
prefix = "/"
path = path[1:]
segments = path.split("/")
for seg in segments:
if seg == "..":
# ignore any .. segments that would otherwise cause an
# IndexError when popped from resolved_path if
# resolving for rfc3986
with suppress(IndexError):
resolved_path.pop()
elif seg != ".":
resolved_path.append(seg)
if segments and segments[-1] in (".", ".."):
# do some post-processing here.
# if the last segment was a relative dir,
# then we need to append the trailing '/'
resolved_path.append("")
return prefix + "/".join(resolved_path)