universal_pathlib icon indicating copy to clipboard operation
universal_pathlib copied to clipboard

Document cross filesystem compatible definition of "/", ".", "", ".."

Open ap-- opened this issue 3 months ago • 3 comments

Originally posted by @cscutcher in #494

For me the big selling point of the library is accepting UPaths but being completely agnostic about what backend is in use. It's probably not possible to be 100% agnostic, but I think a good starting place would be if there were at least clear definitions documented for the meaning of UPath(""), UPath("/"), UPath(".") and UPath("..") .

All 4 examples you provide will return PosixUPath and WindowsUPath instances depending on your operating system. This is because the provided first argument is a non-uri-like path and the protocol keyword parameter is unset. Both PosixUPath and WindowsUPath can basically be thought of as a pathlib.Path subclass with the additional attributes/methods that UPath provides.

In all four cases, before I started thinking about it in the context of UPath, I would have naively said I had a clear understanding of what those paths mean. However, once I started thinking about specifics, especially in the case of UPath's filesystem agnostic approach, I realised I was a bit clueless!

For context, in case it's helpful in your design considerations, I was using the memory backend primarily for testing, so in my case I really wanted behaviour to be as similar as possible to the local filesystem. In the end I awkwardly subclassed MemoryFileSystem and MemoryPath so I work around this issue, but also to implement symlink support which I believe is missing in MemoryFileSystem. I imagine it's possible that my choice of using MemoryFileSystem as a mock local filesystem, goes against the original intent for it, so maybe I was doomed from the start!

The fsspec MemoryFileSystem is indeed most commonly used as a testing filesystem. So your intuition was right here. In general I would avoid symlinks if you want cross-filesystem compatible interactions.

This is because on object store and many of the other filesystems symlinks don't exist. On some like http filesystems for example you could interpret redirects as symlinks, but if you go into the details it's non-trivial.

Another small comment, but how to handle relative paths in general seems an interesting challenge. I'm sure there are good reasons why this isn't the case, but it seems to me that relative paths shouldn't necessarily be tied to any protocol or backend. I can see why making UPath("foo/bar") implicitly a path relative to cwd on the local file system, would be necessary to make .open etc work as a user might expect, but it would be nice to be able to have an explicitly relative path. To me, in a backend agnostic world, a path like foo/bar should only get tied to a specific backend when it's combined with some absolute path object, but on it's own it only states "the subdirectory bar, which is the subdirectory of foo" which should be possible to apply to any filesystem backend equally, if that makes sense.

Unfortunately, relative paths can't fully be decoupled from their filesystem implementations. This all stems from the fact, that (1) fsspec paths are always absolute and (2) they actually have no strict definition of what these paths can be. So a relative path foo/../bar, or foo//bar would mean something different on a local filesystem, vs an s3 bucket.

All that being said, you have a few options to get what you want:

  1. make a relative UPath: (while not supported directly from the constructor, you can make one via relative_to)
    >>> from upath import UPath
    >>> UPath("s3://bucket/foo/bar").relative_to(UPath("s3://bucket/"))
    <relative S3Path 'foo/bar'>
    
  2. consistently use resolve() before file access to ensure . and .. are handled in a pathlib like interpretation:
    >>> from upath import UPath
    >>> UPath("bucket/foo/bar", protocol="s3").joinpath("../abc").resolve()
    S3Path('bucket/foo/abc', protocol='s3')
    
  3. in internal projects that require loads of path traversals, I usually tend to define all relative locations as PurePosixPath instances, and allow to provide a base UPath to determine the root of the absolute filesystem location.

ap-- avatar Dec 03 '25 16:12 ap--

Thanks! I really appreciate you adding this context. A couple of additional comments, feel free to disregard if not helpful.

Regarding definitions

All 4 examples you provide will return PosixUPath and WindowsUPath instances depending on your operating system...

That's useful but I was also kinda getting at is the semantic meaning, e.g. "/ is always the root of a filesystem. The root is always absolute" etc.

According to the posix doc a single dot "...shall refer to the directory specified by its predecessor", but as you say this meaning may not be consistently applied to other backends.

An empty path (i.e. UPath("")) is valid, but what does it represent. With no protocol (i.e. passing through to Path) UPath("") is equivalent to UPath("."), but for memory UPath("", protocol="memory") == UPath(".", memory="memory") . For memory UPath("", protocol="memory") seems to be equivalent to /, presumably as a consequence of . being a valid path for the memory backend. Either way, regardless of the implementation, I think there's still the question of what we mean with path UPath("", protocol=some_proto).

I think if you can get somewhat concise definitions of these special paths documented, and their scope/universality, it'd definitely be useful as a user of the library, but also, I imagine, as a way to spot bugs/improve testing, e.g. I'm not sure if UPath("", protocol="memory") == UPath(".", memory="memory") is a bug or intended behaviour.

I know there are other sources for some of this information. I'm aware of;

but none of these serve well as a concise definition of the psuedo-standard path "language", especially when considering the wider array of back-ends served by this project. Although I can also see reasons you might choose to avoid getting into defining some of this stuff, especially if an appropriate answer might be "defer to pathlib" etc. I suppose it would depend on where you'd like to be on the scale between on one end,

  • A pretty thorough abstraction that can be used without much understanding of the underlying backend libraries,
  • On the other end, a thin shim used to provide compatibility, which require direct use of, and understanding of, the underlying fsspec implementations, and the libraries those implementations depend on.

Regarding symlinks on memory filesystem

In general I would avoid symlinks if you want cross-filesystem compatible interactions.

This is a useful point to consider when designing stuff around UPath and fsspec. However, for my project, one of the things it needs to do is scan and compare symlinks to each other, so it's a bit unavoidable. The reason that symlinks end up as a bit of an afterthought makes sense though, but it would still be great to have the option of a drop in mock filesystem that mimics a "real" linux-y filesystem. I suppose we might imagine a more featureful memory filesystem that has flags to enable symlinks, or whether "." is valid, etc, so that it can be used in place of other filesystem backends in testing.

cscutcher avatar Dec 04 '25 16:12 cscutcher

I suppose we might imagine a more featureful memory filesystem that has flags to enable symlinks, or whether "." is valid, etc, so that it can be used in place of other filesystem backends in testing.

I opened an issue for discussing symlink support #500

ap-- avatar Dec 04 '25 16:12 ap--

I think if you can get somewhat concise definitions of these special paths documented, and their scope/universality, it'd definitely be useful as a user of the library, but also, I imagine, as a way to spot bugs/improve testing, e.g. I'm not sure if UPath("", protocol="memory") == UPath(".", memory="memory") is a bug or intended behaviour.

It would be a great addition to the concepts section of the docs, and the base test cases to formalize the meaning of these special paths and guarantee UPath behavior for future releases.

To add a bit of historical context:

I initially assumed that it should be possible that the minimal path parsing interface of fsspec filesystems (defined as an interface here: https://github.com/fsspec/universal_pathlib/blob/63b6f1c9df7bc0bfd773188f2ef72c88a279ed23/upath/_flavour_sources.py#L60-L77) should be able to provide all functionality to implement path parsers for each filesystem type. Within filesystem_spec these parsing methods don't have strong restrictions regarding path handling though, for example trailing slash behaviour or protocol stripping, or character escaping for URIs etc. This makes it usually very easy to implement a new filesystem, but makes the downstream consumption a bit harder. Early this year I was hoping to free up some time to work on standardized tests upstream https://github.com/fsspec/filesystem_spec/pull/1567#issuecomment-2575901714 but it's been a busy year...

But all that said, I think that it should be possible to write a filesystem independent path parser that will have a few configuration options like allow_empty_parts (as in foo//bar) and reserved_names for excluding . and .., and trailing_slash behavior etc... This is why the _flavour and _flavour_sources modules are still private. The plan is to tackle this for the version 0.4.0+

ap-- avatar Dec 04 '25 17:12 ap--