LocalFileSystem restore _strip_protocol signature
Hello,
I noticed that the LocalFileSystem._strip_protocol signature changed in #1477, when running the universal_pathlib test-suite against the current fsspec version.
To me it seems that the intention of #1477 was initially to prevent "C:/" or "C:\\" to be reduced to "C:" by _strip_protocol, but in addition it introduced function signature changes in fsspec.implementations.local and minor changes in fsspec.mapping.
I created this PR as a basis for discussion if the signature change could be avoided. In this PR I reverted the changes to LocalFileSystem and make_path_posix but kept the changes to the tests (first commit) and then provide an alternative implementation that avoids the function signature change.
I would also be happy to try to add more tests around the LocalFileSystem._parent, LocalFileSystem._strip_protocol and make_path_posix behavior for edge-cases if there is interest. But windows is not my main OS for daily work, so I am very likely not aware of most of them.
Cheers, Andreas
@fleming79 , this undoes much of your code, but maintains the tests for correctness, while reimplementing consistency for _strip_protocol. Do you have thoughts?
I have no issue with removing the option 'remove_trailing_slash' if it isn't required. It was provided because there were .rstrip('/') calls scattered about.
The question I pose is - which code is more maintainable/performant/correct?
Thanks for the quick response!
Regarding maintainability, I would prefer if the method signatures would agree with AbstractFileSystem. For example, enforcing #1104 (where possible) would make downstream development easier.
Regarding performance, I'm not a big fan of micro-benchmarks anymore, but I ran one (checking 10000 generated URIs) and the implementation in this PR seems to be equally fast/slow (within margin of error) as 2024.3.1 and 2024.3.0 for _strip_protocol. It is ~2% slower for _parent. This was on an M2max arm macbook, only tested with py3.11 and only run a few times. Variation between runs is larger than the differences within a run.
[click to expand] benchmark for `_strip_protocol` and `_parent`
❯ asv run
· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.11
·· Building 518984d4 <master> for virtualenv-py3.11...
·· Installing 518984d4 <master> into virtualenv-py3.11.
· Running 6 total benchmarks (3 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For filesystem_spec commit 518984d4 <master>:
[ 0.00%] ·· Benchmarking virtualenv-py3.11
[ 8.33%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[25.00%] ··· bench_strip_protocol.Suite.time_parent 207±0.8ms
[33.33%] ··· bench_strip_protocol.Suite.time_split_protocol 205±2ms
[33.33%] · For filesystem_spec commit 4bd16f64 <2024.3.0^0>:
[33.33%] ·· Building for virtualenv-py3.11.
[33.33%] ·· Benchmarking virtualenv-py3.11
[41.67%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[58.33%] ··· bench_strip_protocol.Suite.time_parent 207±0.7ms
[66.67%] ··· bench_strip_protocol.Suite.time_split_protocol 204±2ms
[66.67%] · For filesystem_spec commit c9eaf46e <local-fs-restore-strip-protocol>:
[66.67%] ·· Building for virtualenv-py3.11....
[66.67%] ·· Benchmarking virtualenv-py3.11
[75.00%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[91.67%] ··· bench_strip_protocol.Suite.time_parent 211±1ms
[100.00%] ··· bench_strip_protocol.Suite.time_split_protocol 204±2ms
# pip install asv
import itertools
import random
import string
from fsspec import get_filesystem_class
random.seed(0)
def _make_random_path():
# build a random length path
pth_len = random.randint(5, 40)
return "".join(random.sample(string.ascii_letters + "/", k=pth_len))
def _make_uris(n):
# create uris with and without protocols and with and without netlocs
it_proto_netloc = itertools.cycle(
itertools.product(
["file", "local", "wrong", None],
["netloc", ""]
)
)
for _ in range(n):
proto, netloc = next(it_proto_netloc)
pth = _make_random_path()
if proto and netloc:
yield f"{proto}://{netloc}/{pth}"
elif proto:
yield f"{proto}:/{pth}"
else:
yield f"{netloc}/{pth}"
uris = list(_make_uris(10000))
class Suite:
def setup(self):
self.fs_cls = get_filesystem_class("file")
self.uris = uris
def teardown(self):
del self.uris
def time_split_protocol(self):
for uri in self.uris:
self.fs_cls._strip_protocol(uri)
def time_parent(self):
for uri in self.uris:
self.fs_cls._parent(uri)
To be able to answer if this implementation is more or less correct would require more tests. I'll try to find some time to come up with more cases for the test-suite.
Cheers, Andreas
I don't think we can trust timing differences <5%; in these cases it can even matter which order you run them (whether the CPU is warm, etc). The interesting thing would be to check on Windows (but they have lots of path styles).
The original make_path_posix contains quite a lot of strip protocol code in it. It also includes 'sep' as an argument implying it can work for an alternate os.
I'd prefer to see the newer code tweaked to remove the signature.
Here is a starting point.
def make_path_posix(path):
"""Make path generic for current OS"""
if not isinstance(path, str):
if isinstance(path, (list, set, tuple)):
return type(path)(make_path_posix(p) for p in path)
else:
path = str(stringify_path(path))
if os.sep == "/":
# Native posix
if path.startswith("/"):
# most common fast case for posix
return path.rstrip("/") or "/"
elif path.startswith("~"):
return make_path_posix(osp.expanduser(path))
elif path.startswith("./"):
path = path[2:]
path = f"{os.getcwd()}/{path}"
return path.rstrip("/") or "/"
return f"{os.getcwd()}/{path}"
else:
# NT handling
if len(path) > 1:
if path[1] == ":":
# windows full path like "C:\\local\\path"
if len(path) <= 3:
# nt root (something like c:/)
return path[0] + ":/"
path = path.replace("\\", "/").replace("//", "/")
return path
elif path[0] == "~":
return make_path_posix(osp.expanduser(path))
elif path.startswith(("\\\\", "//")):
# windows UNC/DFS-style paths
return "//" + path[2:].replace("\\", "/").replace("//", "/")
return make_path_posix(osp.abspath(path))
@classmethod
def _strip_protocol(cls, path):
path = stringify_path(path)
if path.startswith(("file:",'local')):
if os.sep == '\\' and path.startswith("file:///"):
path = path[8:]
elif path.startswith("file://"):
path = path[7:]
elif path.startswith("file:"):
path = path[5:]
elif path.startswith("local://"):
path = path[8:]
elif path.startswith("local:"):
path = path[6:]
drive, path = osp.splitdrive(make_path_posix(path))
path = path.rstrip("/") or cls.root_marker
return drive + path
Thank you @fleming79 for the proposed implementations. I'll try them out shortly, when I find some more time.
I've been collecting more test cases for _strip_protocol (still need to do _parent). In the test cases I basically try to document current behavior.
So far I noticed that between v2024.3.0 and v2024.3.1 there are the following differences in LocalFileSystem._strip_protocol behavior (note, that a lot of the test cases are probably just uncommon edge-cases):
on posix py310
| input | 2024.3.0 | 2024.3.1 | this PR | proposed variant (todo) |
|---|---|---|---|---|
./ |
/cwd |
/cwd/ |
/cwd |
|
file:// |
/cwd |
/cwd/ |
/cwd |
|
file://./ |
/cwd |
/cwd/ |
/cwd |
on windows py311
| input | 2024.3.0 | 2024.3.1 | this PR | proposed variant (todo) |
|---|---|---|---|---|
file:/// |
C: |
C:/cwd |
C:/ |
|
file://///remote/share/path |
///remote/share/path |
C:/cwd/remote/share/path |
///remote/share/path |
|
file:////remote/share/path |
//remote/share/path |
C:/cwd/remote/share/path |
//remote/share/path |
|
file:/foo/bar |
C:/foo/bar |
C:/cwd/foo/bar |
C:/foo/bar |
|
local:/path |
C:/path |
C:/cwd/path |
C:/path |
|
s3://bucket/key |
s3://bucket/key |
C:/cwd/s3:/bucket/key |
s3://bucket/key |
~I will write more tests for _parent.~ (done)
~(Update: the failing windows test succeeds locally on a windows python==3.11. I'll investigate a bit further...)~
Performance comparison on windows python==3.11, shows that the 2023.3.1 implementation is the fastest and the implementation in this PR is significantly worse.
I'll update this benchmark once I test the implementation provided in the comment above.
outdated benchmark
(venv) PS C:\User\user\development\filesystem_spec> asv run
· Creating environments.........
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.11
·· Building 4bd16f64 <2024.3.0^0> for virtualenv-py3.11...........
·· Installing 4bd16f64 <2024.3.0^0> into virtualenv-py3.11.
· Running 6 total benchmarks (3 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For filesystem_spec commit 4bd16f64 <2024.3.0^0>:
[ 0.00%] ·· Benchmarking virtualenv-py3.11
[ 8.33%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[25.00%] ··· bench_strip_protocol.Suite.time_parent 22.5±0.3ms
[33.33%] ··· bench_strip_protocol.Suite.time_split_protocol 20.3±0.2ms
[33.33%] · For filesystem_spec commit 516b0626 <local-fs-restore-strip-protocol>:
[33.33%] ·· Building for virtualenv-py3.11........
[33.33%] ·· Benchmarking virtualenv-py3.11
[41.67%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[58.33%] ··· bench_strip_protocol.Suite.time_parent 43.3±0.3ms
[66.67%] ··· bench_strip_protocol.Suite.time_split_protocol 25.7±0.3ms
[66.67%] · For filesystem_spec commit 47b445ae <2024.3.1^0>:
[66.67%] ·· Building for virtualenv-py3.11...........
[66.67%] ·· Benchmarking virtualenv-py3.11
[75.00%] ··· Running (bench_strip_protocol.Suite.time_parent--)..
[91.67%] ··· bench_strip_protocol.Suite.time_parent 19.5±0.3ms
[100.00%] ··· bench_strip_protocol.Suite.time_split_protocol 16.5±0.2ms
Hello @martindurant and @fleming79
I found some time to continue with this PR, and it would be great to get your feedback.
NOTE: the failing tests seem to be ~unrelated, and sporadic?~ related to a bug in setuptools_scm when using the newest version of git
Summarizing the changes:
- added tests for as many (edge) cases I could think of for
-
make_posix_path -
LocalFileSystem._strip_protocol -
LocalFileSystem._parent
-
- the implementation of
_strip_protocolis now based on @fleming79's suggestion from the comment above with a few minor adjustments:- restored trailing slash handling to 2023.3.0 state
- special handling of "." as a path
- removed the
if len(path) > 1check
- the implementation of
_parentis now using a code-path on windows that is a stripped down version ofntpath.splitdrive, avoiding many checks for cases (byte paths, etc.) that don't seem to be required here.
Performance:
I ran a benchmark on 10000 randomly generated uri's comparing performance of _strip_protocol and _parent
on windows and linux (running in WSL2). The results are as follows:
click to show code for generating the uris
import itertools
import random
import string
from fsspec import get_filesystem_class
random.seed(0)
def _make_random_path(style):
pth_len = random.randint(5, 40)
if style == "posix":
chars = string.ascii_letters + "/"
prefix = ""
elif style == "win":
chars = string.ascii_letters + "\\"
prefix = "c:\\"
elif style == "win-posix":
chars = string.ascii_letters + "/"
prefix = "c:/"
else:
raise ValueError(f"Unknown style {style}")
return prefix + "".join(random.sample(chars, k=pth_len))
def _make_uris(n):
it_proto_netloc_style = itertools.cycle(
itertools.product(
["file", "local", "wrong", None],
["netloc", ""],
["posix", "win", "win-posix"],
)
)
for _ in range(n):
proto, netloc, style = next(it_proto_netloc_style)
pth = _make_random_path(style)
if proto and netloc:
yield f"{proto}://{netloc}/{pth}"
elif proto:
yield f"{proto}:/{pth}"
else:
yield f"{netloc}/{pth}"
uris = list(_make_uris(10000))
class Suite:
def setup(self):
self.fs_cls = get_filesystem_class("file")
self.uris = uris
def teardown(self):
del self.uris
def time_split_protocol(self):
for uri in self.uris:
self.fs_cls._strip_protocol(uri)
def time_parent(self):
for uri in self.uris:
self.fs_cls._parent(uri)
Ubuntu under WSL2
_strip_protocol
| python version | 2024.3.0 |
2024.3.1 |
this PR |
|---|---|---|---|
| 3.8 | 7.32 +- 0.1 ms | 7.15 +- 0.08 ms | 7.37 +- 0.1 ms |
| 3.9 | 6.93 +- 0.1 ms | 7.39 +- 0.3 ms | 7.37 +- 0.6 ms |
| 3.10 | 7.29 +- 0.08 ms | 6.94 +- 0.1 ms | 6.89 +- 0.3 ms |
| 3.11 | 6.33 +- 0.2 ms | 5.92 +- 0.06 ms | 5.97 +- 0.07 ms |
| 3.12 | 6.62 +- 0.1 ms | 5.98 +- 0.03 ms | 6.42 +- 0.05 ms |
_parent
| python version | 2024.3.0 |
2024.3.1 |
this PR |
|---|---|---|---|
| 3.8 | 9.58 +- 0.3 ms | 9.79 +- 0.2 ms | 9.31 +- 0.3 ms |
| 3.9 | 8.61 +- 0.2 ms | 10.3 +- 0.8 ms | 8.44 +- 0.4 ms |
| 3.10 | 8.83 +- 0.2 ms | 9.59 +- 0.4 ms | 9.08 +- 0.6 ms |
| 3.11 | 8.25 +- 0.06 ms | 8.33 +- 0.4 ms | 7.51 +- 0.08 ms |
| 3.12 | 8.05 +- 0.07 ms | 7.93 +- 0.09 ms | 8.14 +- 0.05 ms |
Windows 11
_strip_protocol
| python version | 2024.3.0 |
2024.3.1 |
this PR |
|---|---|---|---|
| 3.8 | 27.1 ± 0.7 ms | 32.2 ± 0.3 ms | 16.9 ± 0.9 ms |
| 3.9 | 27.7 ± 0.3 ms | 33.1 ± 0.2 ms | 15.2 ± 0.2 ms |
| 3.10 | 27.3 ± 0.6 ms | 35.0 ± 0.2 ms | 15.3 ± 0.2 ms |
| 3.11 | 20.9 ± 0.4 ms | 18.4 ± 0.7 ms | 14.7 ± 0.3 ms |
| 3.12 | 20.8 ± 0.2 ms | 16.9 ± 0.2 ms | 15.0 ± 1 ms |
_parent
| python version | 2024.3.0 |
2024.3.1 |
this PR |
|---|---|---|---|
| 3.8 | 31.1 ± 1 ms | 37.6 ± 1 ms | 19.3 ± 0.6 ms |
| 3.9 | 31.1 ± 2 ms | 36.2 ± 0.7 ms | 17.9 ± 0.6 ms |
| 3.10 | 30.5 ± 0.4 ms | 38.8 ± 0.2 ms | 19.0 ± 1 ms |
| 3.11 | 23.7 ± 0.2 ms | 20.9 ± 2 ms | 17.6 ± 0.2 ms |
| 3.12 | 23.7 ± 0.7 ms | 20.4 ± 0.3 ms | 16.8 ± 0.3 ms |
Some open questions:
-
PR #1477 added string coercion to
make_path_posix, causing the newly added tests here to fail. I marked them as xfail for now. It would be great to get your feedback on this. My opinion is that these cases should raise an error. On current mastermake_posix_path(None)will return'/cwd/None'. -
To document the behavior of some uncommon edge cases I added some xfail tests for legacy dos uris (e.g.
file:///c|/foo/bar), rfc8089 uris with localhost (file://localhost/foo/bar), and windows relative paths with drives (c:Path). I don't think it's worth supporting them but if you think otherwise, I can make the changes to add support.
Have a great day! Andreas
Note: the failing CI runs seem to be caused by a new git release? https://github.com/pypa/setuptools_scm/issues/1038
@ap-- It looks like you've made some good changes.
PR https://github.com/fsspec/filesystem_spec/pull/1477 added string coercion to make_path_posix, causing the newly added tests here to fail. I marked them as xfail for now. It would be great to get your feedback on this. My opinion is that these cases should raise an error. On current master make_posix_path(None) will return '/cwd/None'.
I agree those tests are useful and an error should probably be raised for invalid types somehow. I guess removing the coercion would suffice, but maybe it would be better to also check the return type from the call to stringify_path.
causing the newly added tests here to fail. I marked them as xfail for now.
Do those fail with a different exception type now, or just return garbage pathstrings?
the failing CI runs seem to be caused by a new git release? https://github.com/pypa/setuptools_scm/issues/1038
This is for the s3fs build, right? We could pin its requirement on setuptools_scm, wait for a new release, or move that build to hatch (as fsspec has been).
Do those fail with a different exception type now, or just return garbage pathstrings?
Current master returns garbage:
>>> from fsspec.implementations.local import make_path_posix
>>> make_path_posix(object())
'/Users/poehlmann/development/filesystem_spec/<object object at 0x100664b80>'
Possible fixes are:
- change
path = str(stringify_path(path))topath = stringify_path(path)which will then cause anAttributeErrorto be raised due to the missing attribute".startswith". (This is asserted in the test) - check the instance type of
stringify_path(path)and raise a TypeError if notstr - raise a TypeError in the fallthrough case in
stringify_path: https://github.com/fsspec/filesystem_spec/blob/df65747ac92e3493fae6e82081026c8f6df8ba94/fsspec/utils.py#L357-L358
This is for the s3fs build, right? We could pin its requirement on setuptools_scm, wait for a new release, or move that build to hatch (as fsspec has been).
I think this happens when during the s3fs install fsspec gets installed. The traceback shows it failing in hatchling, which uses the hatch-vcs plugin to get the current version via setuptools_scm. setuptools_scm in turn has a compatibility issue with the newest git version which seems to be used on the Github Actions runner already via the conda test-env environment. I'll see if downgrading git will fix the issue.
I'll see if downgrading git will fix the issue.
A specific version of git could be pinned in the CI conda environment file
That made it work ☺️ The py3.11 test failed because the smb tests failed.
rerunning ... I have yet to come up with a way to get SMB to reliably pass every time.
I'm happy with this. @fleming79, any comments?
Awesome.
So we just need a decision regarding make_path_posix behavior:
-
make_path_posix(None)returns/cwd/None(fsspec>=2024.3.1and this PR) -
make_path_posix(None)raisesTypeError(fsspec<=2024.3.0) -
make_path_posix(None)raisesTypeError(might be most intuitive for users?)
Update: I double checked, and <=2024.3.0 raised TypeError because of if "~" in path.
2. seems fine to me, but happy to catch the error and convert to TypeError too (no strong feeling)
I'm happy with 2 .
In stringify_path I suggest changing from isinstance(filepath, pathlib.Path) to isinstance(filepath, pathlib.PurePath).Pathis a subclass ofPurePath. and would mean PurePosixPathandPureWindowsPath` are also converted to strings (which they should). Perhaps some tests involving PurePaths could be added?
I made the changes and added tests for stringify_path.
In
stringify_pathI suggest changing fromisinstance(filepath, pathlib.Path)toisinstance(filepath, pathlib.PurePath).Pathis a subclass ofPurePath. and would meanPurePosixPathandPureWindowsPath` are also converted to strings (which they should). Perhaps some tests involving PurePaths could be added?
PurePath already implements the __fspath__ method, which made the isinstance check after hasattr('fspath') redundant.