filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

pathlib.Path type

Open clbarnes opened this issue 5 years ago • 28 comments

pathlib is great for interacting with paths on the local file system. It's expressive, terse, ergonomic, and typed. I'd love to use fsspec going forward, but don't want to give up the power and convenience of pathlib.

Could AbstractFileSystems contain a factory method which produced a PurePath subclass with the correct separator, prefix etc.? This concrete class could also contain the methods implemented by pathlib.Path or it could just be passed back to the filesystem.

clbarnes avatar Oct 06 '20 11:10 clbarnes

pathlib is great

(maybe my bias against pathlib keeps showing through)

Could AbstractFileSystems contain

I am not opposed in principle, but I don't immediately see how this would work. It sounds like a pain to implement?

martindurant avatar Oct 06 '20 13:10 martindurant

Agreed that it may be a pain to implement all of the FS-touching operations present in Path. The instance would need to keep around a reference to the creating FileSystem which wouldn't be very tidy, and a lot of the methods wouldn't abstract well over other FSs.

However, the API of PurePath (which basically handles splitting on separators, relative paths and so on) is not too bad and IMO would add some ergonomics and type helpfulness. AbstractFileSystems could have a PurePath inner class which inherits from a fsspec.PurePath and basically just allows implementors to set their separators, disallowed characters and so on. You'd then instantiate it with my_fs.PurePath(), could do type annotations with MyFileSystem.PurePath. Downstream users could thus abstract further over fsspec implementations because they don't need to know about the FS' separators and so on to split out path components and file extensions and so on - they just use the PurePath API as they're used to doing on their own filesystem.

clbarnes avatar Oct 06 '20 13:10 clbarnes

Is this something you are proposing to take on? :)

martindurant avatar Oct 06 '20 13:10 martindurant

If anyone else thinks it would be valuable and it would make it through review, I may be able to find time!

clbarnes avatar Oct 06 '20 13:10 clbarnes

If anyone else thinks it would be valuable

Let's see if other people chime in. pathlib has certainly been mentioned before in the issues.

martindurant avatar Oct 06 '20 13:10 martindurant

Hello! At Data Revenue we have crafted a drfs package that does more or less what you described.

Example:

from drfs import DRPath
dir_path = DRPath("s3://bucket/key")
file_path = dir_path / "file.txt"
with file_path.open("rb") as f:
    f.read()

Another example (presenting another functionality of the package) is here.

It is an internal package and was opensourced just yesterday (actually because of this thread), so I realise that the documentation is not perfect and there are bugs here and there. The package will be probably refactored in the near future (possibly as you described - by integrating with fsspec), but maybe it can already be of use for you. :)

michcio1234 avatar Oct 07 '20 07:10 michcio1234

I stumbled across this yesterday https://github.com/drivendataorg/cloudpathlib. Could be used as inspiration here

raybellwaves avatar Oct 08 '20 19:10 raybellwaves

Hurray for the extra functionality, @michcio1234 ! We would gladly see something like this included directly in fsspec, or indeed as a separate package.

@raybellwaves , I will post an issue on that repo to see if they are interested in contributing - or as inspiration, as you say.

martindurant avatar Oct 08 '20 19:10 martindurant

@michcio1234 , also I'd note that your second example starts to look something like an intake catalog spec

martindurant avatar Oct 08 '20 19:10 martindurant

I would also love to have a pathlib.Path like-object for remote paths directly in fsspec. I would prefer to not rely on yet-another-lib mentioned here for this. And pathlib.PurePath does not work because it strips out the first // after the name of the protocol:

import pathlib
my_path = pathlib.Path("s3://my-bucket/experiment-1") / "results.csv"
my_path

returns:

PosixPath('s3:/my-bucket/experiment-1/results.csv')

When used with fsspec, the filesystem selected is the local one:

import fsspec
with fsspec.open(my_path, "w") as f:
    f.write("my results")

Locally a folder s3: is created.

hadim avatar Nov 10 '20 20:11 hadim

Locally a folder s3: is created.

Ooops...

Again, I'd be really happy for someone to contribute this, but it likely won't be me.

martindurant avatar Nov 10 '20 20:11 martindurant

This is something I've been thinking about and wanting too. The particular advantage is that it then allows you to use fsspec with (almost) any function that expects a pathlib.Path object.

I think that it could be quite a simple class that acts as an adapter between the pathlib.Path API and the fsspec.AbstractFileSystem API.

jamesmyatt avatar Nov 16 '20 12:11 jamesmyatt

The particular advantage is that it then allows you to use fsspec with (almost) any function that expects a pathlib.Path object.

Agree. I use pathlib everywhere but recently had to revert my code back because I also wanted to support fsspec paths...

hadim avatar Nov 16 '20 12:11 hadim

One more vote here, we have just started looking into this. One very simple way we tried is just to implement the __fspath__ method, that goes already a long way.

mraspaud avatar Nov 16 '20 21:11 mraspaud

Can you develop on fspath?

hadim avatar Nov 16 '20 21:11 hadim

My reading of fspath (with or without underscores) if that it implicitly applies to local paths. Currently, local and smb implementations support fspath.

martindurant avatar Nov 16 '20 21:11 martindurant

fspath does not seem to solve the issue that the protocol is stripped out of the original path:

import pathlib

p = pathlib.PurePath('s3://bucket-data/setup.py')
print(str(p), p.__fspath__())
s3:/bucket-data/setup.py s3:/bucket-data/setup.py

s3:// is replaced by s3:/.

One way would be to monkey-patch pathlib.PurePath() but this is obviously not a good idea to monkey patch the stdlib.

hadim avatar Nov 16 '20 21:11 hadim

ok, sorry I wasn't clear, I just mean that supporting the path protocol (not the full pathlib.PurePath) help us a lot already.

mraspaud avatar Nov 16 '20 22:11 mraspaud

It looks like someone is building a lib for this: https://github.com/Quansight/universal_pathlib

Friendly ping to @tonyfast and @andrewfulton9: would you consider contributing directly here instead of in an external library?

@martindurant any thoughts on this?

hadim avatar Jan 26 '21 12:01 hadim

Sure! Since it needs no further dependencies, it might well be hosted within fsspec - but maybe it doesn't matter. I'd be happy to put something into fsspec's docs pointing to the examples of upath for those looking for Path support, if the package is to remain separate. As far as I can see, upath has only been up for six days, so we'll see how it goes.

martindurant avatar Jan 26 '21 14:01 martindurant

I think https://github.com/Quansight/universal_pathlib is pretty much the kind of adapter I had in mind. It looks like it's far too early to include in fsspec now, but I think it ought to be integrated when it's more mature.

jamesmyatt avatar Jan 26 '21 14:01 jamesmyatt

Even if it's an early project, it would make sense to add early in fsspec but I guess it depends on what the dev wants to do here.

hadim avatar Jan 26 '21 14:01 hadim

Let's wait for the quansight people to comment :)

martindurant avatar Jan 26 '21 14:01 martindurant

@martindurant , agreed.

@hadim , if it were me, this early in a project, then I'd want to be able to iterate faster than a more established project like fsspec.

jamesmyatt avatar Jan 26 '21 15:01 jamesmyatt

Thanks for pinging me. I'm glad to see the interest in upath from you all. I would lean towards waiting to moving it into fsspec for the reason @jamesmyatt mentions above. I built this package to use in internal client projects, and I am still coming across bugs in it pretty frequently, so being able to push up releases with bug fixes as soon as they are fixed is pretty valuable to me. Once it is more stable though, I am definitely open to merging it into fsspec, if that still looks like the right path forward.

andrewfulton9 avatar Jan 26 '21 20:01 andrewfulton9

Do you think it's worth mentioning somewhere in the fsspec documentation yet?

martindurant avatar Jan 26 '21 20:01 martindurant

I think it is worth mentioning, maybe with the caveat that it's still in the early stage of development. So far my test suite covers most of the relevant pathlib.Path methods using PyarrowHDFS, S3FS, and LocalFileSystem (via a mock) as back-ends. I have also tried GithubFileSystem and MemoryFilesystem and they both seem to work as well. Any filesystem that follows similar path paradigms should work too as far as I can tell. The HTTPFilesystem backend doesn't currently work since its methods expect a a full URL rather than just the path part of a URL. I am about to push a fix up for that this afternoon though. I think for most users it should be useable, and extra exposure could help me move it along faster by increasing issues/contributions. This evening I can add more examples to the examples notebook and I'll add more details to the README as well.

andrewfulton9 avatar Jan 26 '21 21:01 andrewfulton9

What is the progress? @andrewfulton9 there is no supprt for huggingface file system 'hf'

pbk0 avatar May 24 '24 04:05 pbk0

A hugging face implementation using the fsspec huggingface filesystem just landed in universal-pathlib>=0.3.5

ap-- avatar Nov 10 '25 13:11 ap--

This issue can be closed?

martindurant avatar Nov 12 '25 18:11 martindurant