yarl ValueError on path remove_dot_segments when there's extra dot-dot (`/../`) segments

🐞 Describe the bug As described in RFC 3986 § 5.2.4. Remove Dot Segments, the remove_dot_segments algorithm removes any extra /../ parts of the URL, ignoring errors when the stack is empty.

However, yarl.URL() behavior at the moment is to raise an exception, ValueError, when that happens.

💡 To Reproduce Instantiate URL class with such a URL string:

yarl.URL('https://example.com/alice/../../eve')

💡 Expected behavior Follow the RFC, and ignore the error, to get it working like this:

In [16]: yarl.URL('https://example.com/alice/../eve')
Out[16]: URL('https://example.com/eve')

In [17]: yarl.URL('https://example.com/alice/../../eve')
Out[17]: URL('https://example.com/eve')

In [18]: yarl.URL('https://example.com/alice/../../../eve')
Out[18]: URL('https://example.com/eve')

We probably want to also test (and fix, if needed), any other parts of the API that would involve path resolution (including remove_dot_segments) steps.

📋 Logs/tracebacks

In [15]: yarl.URL('https://example.com/alice/../../eve')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e72c86c9e132> in <module>()
----> 1 yarl.URL('https://example.com/alice/../../eve')

~/ans/venv/lib/python3.6/site-packages/yarl/_url.py in __new__(cls, val, encoded, strict)
    181                 path = cls._normalize_path(path)
    182
--> 183             cls._validate_authority_uri_abs_path(host=host, path=path)
    184             query = cls._QUERY_REQUOTER(val[3])
    185             fragment = cls._FRAGMENT_REQUOTER(val[4])

~/ans/venv/lib/python3.6/site-packages/yarl/_url.py in _validate_authority_uri_abs_path(host, path)
    675         if len(host) > 0 and len(path) > 0 and not path.startswith("/"):
    676             raise ValueError(
--> 677                 "Path in a URL with authority should start with a slash ('/') if set"
    678             )
    679

ValueError: Path in a URL with authority should start with a slash ('/') if set

📋 Your version of the Python

$ python --version
Python 3.6.10

📋 Your version of the aiohttp/yarl/multidict distributions

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.6.2

$ python -m pip show multidict
Name: multidict
Version: 4.7.6

$ python -m pip show yarl
Name: yarl
Version: 1.6.2

📋 Additional context The proposed behavior would put yarl.URL on par with DOM's URL object, which already follows the RFC on this:

> String(new URL("https://example.com/alice/../eve"))
< "https://example.com/eve"
> String(new URL("https://example.com/alice/../../eve"))
< "https://example.com/eve"
> String(new URL("https://example.com/alice/../../../eve"))
< "https://example.com/eve"

As well as rust-url library (from the unit tests):

  {
    "input": "http://example.com/foo/bar/../ton/../../a",
    // ...
    "pathname": "/a",
    // ...
  },
  {
    "input": "http://example.com/foo/../../..",
    // ...
    "pathname": "/",
    // ...
  },
  {
    "input": "http://example.com/foo/../../../ton",
    // ...
    "pathname": "/ton",
    // ...
  },

which has tests based on (warning: broken links):

  "# Based on http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js",
  "# AS OF https://github.com/jsdom/whatwg-url/commit/35f04dfd3048cf6362f4398745bb13375c5020c2",

Oct 14 '20 19:10 besfahbod

Interesting. I (maybe wrong) read the RFC as "remove_dot_segments is applicable to path joining procedure only". I agree that the same behavior as mentioned JavaScript and Rust libraries is desirable.

A pull request is welcome!

Nov 14 '20 20:11 asvetlov

So, here's the relevant pieces from the RFC:

And:

And the algorithm under section 5.2.2. Transform References.

I think what we need to pay attention to is the difference between a "Reference URL" and a "Target URL". A Target URL is expected to be resolved and normalized, meaning that it's expected to not have any . or .. sequences.

From that perspective, I believe the input to yarl.URL() is expected to be a Reference URL. Does that sound about right?

We could also consider comparing the rules with https://url.spec.whatwg.org/ , which seem to be more close to the common "URL Library" implementations.

Anyways, since we do desire the compatibility, as suggested above, I'm going to get it to work as suggestion.

Jan 15 '21 23:01 besfahbod

which has tests based on (warning: broken links):

  "# Based on http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js",
  "# AS OF https://github.com/jsdom/whatwg-url/commit/35f04dfd3048cf6362f4398745bb13375c5020c2",

The last revision in the Webkit LayoutTests repository with such a file is 149680; the next rev folded the segments.js file into segments.html. It doesn't actually contain any tests for .. and . handling.

If you were looking at urltestdata.json (on which the Rust tests are actually based), you may have missed that there are multiple # Based on comments, the relevant section is based on paths.js in the same Webkit test suite, nowadays at LayoutTests/fast/url/paths.html

Jun 14 '22 16:06 mjpieters

I've converted those tests from JSON to a small Python script:

from yarl import URL
from urllib.parse import unquote, quote

cases = [
    ("/././foo", "/foo"),
    ("/./.foo", "/.foo"),
    ("/foo/.", "/foo/"),
    ("/foo/./", "/foo/"),
    ("/foo/bar/..", "/foo/"),
    ("/foo/bar/../", "/foo/"),
    ("/foo/..bar", "/foo/..bar"),
    ("/foo/bar/../ton", "/foo/ton"),
    ("/foo/bar/../ton/../../a", "/a"),
    ("/foo/../../..", "/"),
    ("/foo/../../../ton", "/ton"),
    ("/foo/%2e", "/foo/"),
    ("/foo/%2e%2", "/foo/%2e%2"),
    ("/foo/%2e./%2e%2e/.%2e/%2e.bar", "/%2e.bar"),
    ("////../..", "//"),
    ("/foo/bar//../..", "/foo/"),
    ("/foo/bar//..", "/foo/bar/"),
    ("/foo", "/foo"),
    ("/%20foo", "/%20foo"),
    ("/foo%", "/foo%"),
    ("/foo%2", "/foo%2"),
    ("/foo%2zbar", "/foo%2zbar"),
    ("/foo%2Â©zbar", "/foo%2%C3%82%C2%A9zbar"),
    ("/foo%41%7a", "/foo%41%7a"),
    ("/foo\t\u0091%91", "/foo%C2%91%91"),
    ("/foo%00%51", "/foo%00%51"),
    ("/(%28:%3A%29)", "/(%28:%3A%29)"),
    ("/%3A%3a%3C%3c", "/%3A%3a%3C%3c"),
    ("/foo\tbar", "/foobar"),
    ("\\\\foo\\\\bar", "//foo//bar"),
    ("/%7Ffp3%3Eju%3Dduvgw%3Dd", "/%7Ffp3%3Eju%3Dduvgw%3Dd"),
    ("/@asdf%40", "/@asdf%40"),
    ("/你好你好", "/%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD"),
    ("/‥/foo", "/%E2%80%A5/foo"),
    ("//foo", "/%EF%BB%BF/foo"),
    ("/\u202e/foo/\u202d/bar", "/%E2%80%AE/foo/%E2%80%AD/bar"),
]
requote = False
for i, (test, expected) in enumerate(cases, 1):
    if requote:
        expected = quote(
            unquote(expected, encoding="latin1"), safe="/@:()=", encoding="latin1"
        )
    try:
        actual = URL(f"http://example.com{test}").raw_path
    except ValueError as e:
        actual = f"exception: {e}"

    if actual != expected:
        print(f"({i}) \x1b[31mFAIL\x1b[0m: {test}")
        print(f"         {actual}\n      != {expected}")

and this outputs:

(11) FAIL: /foo/../../../ton
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /ton
(13) FAIL: /foo/%2e%2
         /foo/.%252
      != /foo/%2e%2
(14) FAIL: /foo/%2e./%2e%2e/.%2e/%2e.bar
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /%2e.bar
(20) FAIL: /foo%
         /foo%25
      != /foo%
(21) FAIL: /foo%2
         /foo%252
      != /foo%2
(22) FAIL: /foo%2zbar
         /foo%252zbar
      != /foo%2zbar
(23) FAIL: /foo%2Â©zbar
         /foo%252%C3%82%C2%A9zbar
      != /foo%2%C3%82%C2%A9zbar
(24) FAIL: /foo%41%7a
         /fooAz
      != /foo%41%7a
(26) FAIL: /foo%00%51
         /foo%00Q
      != /foo%00%51
(27) FAIL: /(%28:%3A%29)
         /((::))
      != /(%28:%3A%29)
(28) FAIL: /%3A%3a%3C%3c
         /::%3C%3C
      != /%3A%3a%3C%3c
(30) FAIL: \\foo\\bar
         /
      != //foo//bar
(31) FAIL: /%7Ffp3%3Eju%3Dduvgw%3Dd
         /%7Ffp3%3Eju=duvgw=d
      != /%7Ffp3%3Eju%3Dduvgw%3Dd
(32) FAIL: /@asdf%40
         /@asdf@
      != /@asdf%40

Several of these are not actual failures; yarl.URL() gives us equivalent paths with valid %hh substitutions for 13, 20-24, 26-28, 31 and 32, and in addition 28 also replaced %3c with %3C, so uppercase hex instead of lowercase. Those are fine, and those disappear when you set requote to True.

That leaves 11, 14 and 30:

(11) FAIL: /foo/../../../ton
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /ton
(14) FAIL: /foo/%2e./%2e%2e/.%2e/%2e.bar
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /%2e.bar
(30) FAIL: \\foo\\bar
         /
      != //foo//bar

I'm not so sure about turning backslashes into forward slashes nor am I inclined to research that specific issue right now.

The following alteration to the _normalize_path() loop makes 11 and 14 pass:

        for seg in segments:
            if seg not in (".", ".."):
                resolved_path.append(seg)
            elif seg == ".." and (len(resolved_path) > 1 or resolved_path[-1]):
                resolved_path.pop()
            # seg == "."

Jun 14 '22 17:06 mjpieters

Thanks for working on this, @mjpieters! I generally agree with the direction here.

Specifically, I think not-raising-an-exception is the most-important problem to fix here, as it causes processing of user-provided data to fail in a place where no such errors are expected.

Jun 14 '22 17:06 besfahbod

Here's an alternative version, one avoids the (slower) if check for each path segment like the original did:

        prefix, resolved_path = "", []
        if path.startswith("/"):
            # preserve the "/" root element of absolute paths.
            prefix = "/"
            path = path[1:]

        segments = path.split("/")
        for seg in segments:
            if seg == "..":
                # ignore any .. segments that would otherwise cause an
                # IndexError when popped from resolved_path if
                # resolving for rfc3986
                with suppress(IndexError):
                    resolved_path.pop()
            elif seg != ".":
                resolved_path.append(seg)

        if segments and segments[-1] in (".", ".."):
            # do some post-processing here.
            # if the last segment was a relative dir,
            # then we need to append the trailing '/'
            resolved_path.append("")

        return prefix + "/".join(resolved_path)

Jun 14 '22 19:06 mjpieters

yarl yarl copied to clipboard

ValueError on path remove_dot_segments when there's extra dot-dot (`/../`) segments

yarl
yarl copied to clipboard