cloudpathlib icon indicating copy to clipboard operation
cloudpathlib copied to clipboard

Hello from fsspec!

Open martindurant opened this issue 3 years ago • 18 comments

I see that we have some goals in common. fsspec is a mature library in use by Dask, pandas, zarr and others. However, we don't have a good implementation of pathlib on our API. Would you like to join forces to implement pathlib-on-fsspec, using the code you have already developed?

xref: https://github.com/intake/filesystem_spec/issues/434

another attempt to do this, explicitly for fsspec: https://github.com/datarevenue-berlin/drfs

martindurant avatar Oct 08 '20 19:10 martindurant

👋 @martindurant sounds cool!

There are probably a few level of integration that could work well. We intend to make it pretty easy to add custom clients/backends by implementing a relatively small surface area. We're adding more docs on this, but here's the boto3 one as an example. Implementing each of the functions there should make it simple to enable all of the pathlib API that we support:

https://github.com/drivendataorg/cloudpathlib/blob/8692c596d98292c4ca44fd2e654effb595db1427/cloudpathlib/s3/s3client.py

Option (1): I could see a FsspecClient as part of cloudpathlib that if you pip install cloudpathlib[fsspec] then it uses the fsspec client for any of the schemes that fsspec supports. There's probably some details in the coupling between the specific kind of path object we return (e.g., S3Path) and the Client object that would need to be worked out, but that all seems pretty doable.

Option (2): You could that same extension work for a custom Client and CloudPath object within fsspec and the expose the factory methods as you see fit in your API.

Other thoughts on interoperability? There's definitely a big chunk of cool things in fsspec that we don't want to duplicate (chunks/buffering, compressing cache, async, etc.).

pjbull avatar Oct 09 '20 02:10 pjbull

I would say those two options sound pretty similar, except for where the code lives - which I am not too worried about at all. You are probably more likely to get to it, if you have time to experiment - but seeing that this was already requested at fsspec, maybe someone else in the community contributes some pathlib interface anyway.

Do you think there's an argument for S3FileSystem, GCSFileSystem and FTPFileSystem from fsspec-land to supplant the implementations you have or are developing? They may be more complete and reduce the amount of work you need to do.

martindurant avatar Oct 09 '20 20:10 martindurant

@martindurant @pjbull

I took a quick stab at making FsspecClient and FsspecPath classes that is backed by an fsspec FileSystem in #109.

from cloudpathlib.fsspec import FsspecClient, FsspecPath
from s3fs import S3FileSystem

FsspecClient(filesystem=S3FileSystem()).set_as_default_client()

cp = FsspecPath("s3://drivendata-public-assets/")
list(cp.iterdir())[0:3]
#> [FsspecPath('s3://drivendata-public-assets/CIP-from-brochure.jpg'), FsspecPath('s3://drivendata-public-assets/EPD-report.png'), FsspecPath('s3://drivendata-public-assets/Ending-Extreme-Poverty_A-Focus-on-Children.png')]

I think the tricky part that I don't have a great sense of is how to design the right interface to let users choose to use fsspec and to initialize the correct fsspec FileSystem class with minimal friction. In cloudpathlib currently, we already have a way of specifying dependencies with extras_require and with class dispatching based on the prefix/protocol in the bath. Choosing to use fsspec and dispatching to the right fsspec filesystem, and having dependencies specified is like two more layers of user choices.

The question of moving our backend logic entirely to fsspec is a pretty big one and would involve a lot of changes. This is my first time using fsspec, so I still don't have a great feel for the fsspec design or ecosystem.

jayqi avatar Nov 16 '20 04:11 jayqi

Typical usage in fsspec allows access to the set of backend implementations using a protocol string and arguments, e.g., fsspec.filesystem("s3", anon=True), or by passing URLs including a prefix fsspec.open("s3://bucket/path/file", anon=True). (anon is just an example here, specific to s3)

Many packages that use fsspec, and some internal logic too, uses storage_options as a dict of kwargs to pass to the main entry functions, e.g., pd.read_csv("s3://bucket/path/file", storage_options={"anon": True}).

The current set of optional deps for fsspec are:

    extras_require={
        "abfs": ["adlfs"],
        "adl": ["adlfs"],
        "dask": ["dask", "distributed"],
        "dropbox": ["dropboxdrivefs", "requests", "dropbox"],
        "gcs": ["gcsfs"],
        "git": ["pygit2"],
        "github": ["requests"],
        "gs": ["gcsfs"],
        "hdfs": ["pyarrow"],
        "http": ["requests", "aiohttp"],
        "sftp": ["paramiko"],
        "s3": ["s3fs"],
        "smb": ["smbprotocol"],
        "ssh": ["paramiko"],
    },

but we don't depend on anything just to import the package, and some of the backends like file and memory. Sounds like we came up with a similar set of solutions!

Note that there are some implementations that are meant/able to be used with a further target filesystem, such as the local caches.

I'd be happy to talk about how fsspec works and describe our design/ecosystem.

martindurant avatar Nov 16 '20 14:11 martindurant

Hello there !

I would like to point out that the support of fsspec would be incredibly useful for me, as for now I need to cache my pandas and geopandas files before trying to open them.

Thanks a lot :)

remi-braun avatar Jun 30 '21 07:06 remi-braun

@remi-braun Thanks for the vote for fsspec! So that I understand your use case, is the reason that you want an fsspec implementation that you want to be able to do streaming read/write rather than downloading the whole file to a local cache first?

FYI, @jayqi did a good implementation of fsspec as a backend in #109. However, it's different from just a pathlib API for fsspec since it uses the cloudpathlib caching logic and business logic by implementing our *Client and *Path classes. To really take advantage of their approach to caching and reading/writing we'd likely need to re-implement a lot of the reading/writing (and even things like listing which can be cached in fsspec).

(Here's the issue just on streaming: https://github.com/drivendataorg/cloudpathlib/issues/9 )

pjbull avatar Jun 30 '21 15:06 pjbull

Arf sadly it is much more prosaic: my prime concern is to unify my code using agnostically files in the cloud or on disk and for now it is not possible :

path = AnyPath(path)

# Load vector in cache if needed (geopandas and cloudpathlib are not compatible for now)
if isinstance(path, CloudPath):
    path = AnyPath(path.fspath)

vect = geopandas.read_file(vect_path)

However I do not use files big enough to see any slowdown caching files in my usecase (apart from rasterio rasters, but they do not use fsspec if I'm correct)

remi-braun avatar Jun 30 '21 15:06 remi-braun

Ah, interesting. That looks to me like a geopandas feature request since they don't support os.PathLike. They do say they support filelike objects. You could try the following and see if it works. If they support filelike objects then I think it should be fine with CloudPath or Path objects:

path = AnyPath(path)

with path.open() as f:
    vect = geopandas.read_file(f)

Recent versions of pandas should work fine with CloudPath. For example:

In [1]: import pandas as pd

In [2]: from cloudpathlib import CloudPath

In [3]: df = pd.read_csv(CloudPath('s3://drivendata-public-assets/odsc-west-2019/california-tracts.csv'))

In [4]: df.head()
Out[4]:
        GEOID  year    name             parent-location  population  ...  eviction-rate  eviction-filing-rate  low-flag  imputed  subbed
0  6001400100  2000  4001.0  Alameda County, California     2497.87  ...           0.00                  0.00         1        0       0
1  6001400100  2001  4001.0  Alameda County, California     2497.87  ...           0.00                  0.86         1        0       0
2  6001400100  2002  4001.0  Alameda County, California     2497.87  ...           0.80                  0.80         1        0       0
3  6001400100  2003  4001.0  Alameda County, California     2497.87  ...           1.49                  1.49         1        0       0
4  6001400100  2004  4001.0  Alameda County, California     2497.87  ...           0.00                  0.00         1        0       0

[5 rows x 27 columns]

pjbull avatar Jun 30 '21 16:06 pjbull

I think the point of the discussion is that we could use cloudpath to have fsspec support generic pathlike objects without having to ask third-party libraries such as geopandas to do extra work.

martindurant avatar Jun 30 '21 16:06 martindurant

Ah, interesting. That looks to me like a geopandas feature request since they don't support os.PathLike. They do say they support filelike objects. You could try the following and see if it works. If they support filelike objects then I think it should be fine with CloudPath or Path objects:

path = AnyPath(path)

with path.open() as f:
    vect = geopandas.read_file(f)

Recent versions of pandas should work fine with CloudPath. For example:

In [1]: import pandas as pd

In [2]: from cloudpathlib import CloudPath

In [3]: df = pd.read_csv(CloudPath('s3://drivendata-public-assets/odsc-west-2019/california-tracts.csv'))

In [4]: df.head()
Out[4]:
        GEOID  year    name             parent-location  population  ...  eviction-rate  eviction-filing-rate  low-flag  imputed  subbed
0  6001400100  2000  4001.0  Alameda County, California     2497.87  ...           0.00                  0.00         1        0       0
1  6001400100  2001  4001.0  Alameda County, California     2497.87  ...           0.00                  0.86         1        0       0
2  6001400100  2002  4001.0  Alameda County, California     2497.87  ...           0.80                  0.80         1        0       0
3  6001400100  2003  4001.0  Alameda County, California     2497.87  ...           1.49                  1.49         1        0       0
4  6001400100  2004  4001.0  Alameda County, California     2497.87  ...           0.00                  0.00         1        0       0

[5 rows x 27 columns]

Sadly it is not the case, and I don't know why... I haven't investigated though, but my workaround is sufficient for now 😉

remi-braun avatar Jun 30 '21 16:06 remi-braun

Thanks, @martindurant. How would you use fsspec when opening a file with geopandas? I'm curious what fsspec implements that would make this work without having to explicitly download to a local file first and point to that local path.

(That said, I would encourage library maintainers to consider supporting os.PathLike since it is the official Python way to try to go from an object to a local file path.)

(Also, would be curious about your thoughts on testing #109 to see if it is viable! The fact it doesn't take advantage of all of the hard work you've done on caching, streaming, etc. does make it seem less appealing.)

pjbull avatar Jun 30 '21 16:06 pjbull

what fsspec implements that would make this work without having to explicitly download to a local file first and point to that local path

fsspec can do this job implicitly; but yes, there are libraries which, typically because of some inner C code, can only work with real OS file handles on the local filesystem.

it is the official Python way to try to go from an object to a local file path

This is an exaggeration! :)

On #109, I think it's totally the way forward. For configuring the relevant backends, you could rely on the fsspec config system, or maybe allow extra kwargs when instantiating a Path, which get passed to other downstream Paths.

martindurant avatar Jun 30 '21 16:06 martindurant

This is an exaggeration!

😆 certainly an exaggeration in terms of it actually getting implemented by libraries, but it is a PEP that was accepted for exactly this purpose! https://www.python.org/dev/peps/pep-0519/

pjbull avatar Jun 30 '21 16:06 pjbull

fsspec can do this job implicitly; but yes, there are libraries which, typically because of some inner C code, can only work with real OS file handles on the local filesystem.

Ok, I'll do some digging to check out the mechanisms you use for this. When we visited this question, implementing os.PathLike seemed to us the "right" answer to make this seamless, but maybe there are other approaches worth considering.

pjbull avatar Jun 30 '21 21:06 pjbull

implementing os.PathLike seemed to us the "right" answer

I'm sure both are useful! fsspec wants to provide full filesystem classes with all the functionality that implies, and several high-level features that you might find interesting, but not necessary for your core offering. To have those and a nice pathlib interface would be awesome!

(( For instance (I happen to be working on this today), ReferenceFileSystem lets you view blocks of bytes at arbitrary offsets within arbitrary URLs as files, so that a library like zarr (which uses file naming conventions to locate blocks of array data) can load data from third party HDF5, tiff and grib2 files whether or not the libraries made for those formats know how to handle remote files or not. ))

By the way, in my experience, most third party libraries work with file-like objects and paths. If a path-like is passed, they do str() at the earliest opportunity and end up treating it like a (local) path. Of course, I may well be biased in the set of packages I am exposed to - many of which know to defer opening paths to fsspec.

martindurant avatar Jul 01 '21 01:07 martindurant

Hi there! I just wanted to mention that I would benefit from support for pathlib with fsspec: https://github.com/dask/dask/issues/8006 Many thanks for any help! :)

asmith26 avatar Aug 16 '21 11:08 asmith26

Hi! Ditto @asmith26 - I'd love to see more compatibility between fsspec and cloudpathlib's wonderful pathlib.Path interface.

My use case is when using pandas. I mention this in https://github.com/drivendataorg/cloudpathlib/issues/128.

Related issues:

  • this project
    • using pd.HDFStore(cloud_path) https://github.com/drivendataorg/cloudpathlib/issues/240
    • another issue when calling pd.DataFrame.to_csv(cloud_path) https://github.com/drivendataorg/cloudpathlib/issues/242
  • other projects
    • zarr, and support for pathlib.Path https://github.com/dask/dask/issues/8006

grisaitis avatar Sep 15 '22 20:09 grisaitis

See also universal_pathlib, which I think is more recent than this discussion - although I am not in a position to compare the two projects.

martindurant avatar Sep 15 '22 20:09 martindurant

universal_pathlib seems nice, but I didn't find any documentation and the project seems (too) young.

As per @grisaitis, I would also love those two approaches to mix!

Other well-known projects supporting fsspec are rasterio and xarray. With geopandas, those libs are pillars to the remote-sensing opensource community .

remi-braun avatar Jan 19 '23 16:01 remi-braun

@remi-braun Are there specific issues that you're seeing? If there are bugs, it would be great to get them addressed. From an API perspective, both rasterio (as of this PR) and rioxarray seem to work with CloudPath objects already without going through fsspec.

from cloudpathlib import CloudPath

from matplotlib import pyplot as plt

import rasterio as rio
import rioxarray

cp = CloudPath("s3://drivendata-competition-biomassters-public-us/train_agbm/0003d2eb_agbm.tif")

# test rasterio directly
dataset = rio.open(cp)
plt.imshow(dataset.read(1), cmap='pink')
#> <matplotlib.image.AxesImage object at 0x1a2a333d0>
dataset.close()
#> None


# test rasterio + xarray
rds = rioxarray.open_rasterio(cp)
rds
#> <xarray.DataArray (band: 1, y: 256, x: 256)>
#> [65536 values with dtype=float32]
#> Coordinates:
#>   * band         (band) int64 1
#>   * x            (x) float64 0.5 1.5 2.5 3.5 4.5 ... 252.5 253.5 254.5 255.5
#>   * y            (y) float64 0.5 1.5 2.5 3.5 4.5 ... 252.5 253.5 254.5 255.5
#>     spatial_ref  int64 0
#> Attributes:
#>     scale_factor:  1.0
#>     add_offset:    0.0
image

I looked at the XArray docs, and there are a done of complex IO options. It looks like a number of the writing methods won't work since they don't support file-like objects. In this case, passing the str(CloudPath) off should hand it to their fsspec implementation. The method that does support file-like objects is engine="scipy" for netcdf. This has a bug in their implementation where close gets called multiple times, so we upload multiple times in quick succession. We have a small bug there to make our close idempotent. Are there other key scenarios in XArray to test/support?


That said, there still seems to be some confusion here about what we can implement within cloudpathlib. Unfortunately, the bulk of interoperability issues are outside of our control. I'll do my best to clear that up here and then close this issue.

First, cloudpathlib is already generally compatible with fsspec. Doing fs.open(str(cloudpath_object)) should work in most scenarios as long as the fsspec filesystem is configured properly. This lets you manipulate paths with cloudpathlib and read/write with fsspec. I wish you didn't have to call str for this, but to change that fsspec would need a simple shim to make the calling of str unncessary and do the dispatch/validation themselves.

Second, the general case for other libraries. cloudpathlib provides standard PathLike and file-like (through .open) objects for working with the contents of files. How other libraries decide to treat PathLike and file-like objects is out of our control. Third party libraries that support fsspec have their own code that they wrote to special case being passed strings like s3://.... There is no way for cloudpathlib to know a library has these special cases, and even if we did there would be no way for us to pretend to be a normal string to trigger this code. If you want to trigger other libraries' special-case code like this either calling str on the CloudPath or CloudPath.as_uri() is the most likely way to make this work. Calling str on a Path object should also work for most libraries, so you can still write CloudPath/Path agnostic code that should work with these libraries and take advantage of how they work with fsspec.

Third, developers of other libraries shouldn't need any special case code for CloudPaths to "just work" (as long as they handle PathLike and file-like objects normally). The one caveat is that per this PEP fspath can only return a string. Because of this, in writing scenarios code that uses CloudPaths developers should use CloudPath.open and hand off the file-like object to other libraries. Unfortunately, there is no feasible way to effectively detect if libraries use the built-in open to operate on a CloudPath.

Another separate question is if we will develop and maintain an fsspec provider like the PoC in #109. We don't have any plans to do so. Doing so would not address any of the considerations above. It would just mean we do provider communication through fsspec, but not that any third party library would magically do the right thing with the file-like/PathLike interfaces we provide. You would still need the str/.as_uri calls to get that library to run their fsspec code which is almost always based on interpreting strings. In addition to not addressing these concerns, implementing that backend will add significant maintenance burden that doesn't seem worth it until there is sufficient user demand for providers/functionality that we don't have implemented with the official SDKs. Maybe the best option is for an enterprising developor to take #109 and create a seprate project that supports an fsspec client. We already support and easy way to make extesions by creating and registering custom CloudPath implementations, that is a relatively small codebase. It's not well documented, but looking at any of the existing implementations contains everything necessary to build an extension to support a new provider. More docs coming soon. Again, worth reiterating that doing so would not mean that libraries that know about fsspec would then magically know what to do.

So with all of that said, there's nothing we can implement in cloudpathlib to make substantive progress here. The only thing that we can do is encourage developers of other libraries where things are broken to support best practices, which allow them to support CloudPaths. These include:

  1. Checking if objects passed to read/write functions are PathLike, and if so using what it gets from fspath to do the reading/writing.

  2. Ensure that functions that do reading and writing support file-like objects, not just strings as paths.

The other best practice to ensure everything works well is for consumers of cloudpathlib to use CloudPath.open in write scenarios and pass the file-like object to other libraries.

One last word here: there many be real bugs on our side that prevent third party libraries from working with CloudPaths. All of this discussion is about the design and intent, and we're happy to fix bugs that are discovered.

pjbull avatar Jan 22 '23 02:01 pjbull

Hello,

Just to answer, since this issue is closed. The main problem is the libs based on opening files with fsspec won't work properly with CloudPaths. They will call fspath to resolve it (instead of str()) and therefore the file called will be cached (this is not exaclty what we want, and this will be done silently)

For single files like tif this will be not optimal but will work (your example) But for nested files likes VRT, this will crash as only the header is cached and not the relative files. This will break the paths inside the VRT. Note that GDAL (and therefore rasterio) can read VRT stored on the cloud, and this is the expected behavior.

A little example:


vrt = AnyPath("https://s3.unistra.fr/merge_32-31.vrt")
rasterio.open(str(vrt)) # will work
rasterio.open(vrt) # will fail for the wrong reason, saying that the files linked inside the VRT don't exist

remi-braun avatar Jan 24 '23 14:01 remi-braun

Thanks @remi-braun, I opened a new issue to track that kind of scenario. fsspec integration isn't the root issue, so it merits its own conversation in #315.

pjbull avatar Jan 24 '23 17:01 pjbull