datajoint-python icon indicating copy to clipboard operation
datajoint-python copied to clipboard

Add fsspec as alternate file protocol

Open chrisroat opened this issue 4 years ago • 15 comments
trafficstars

In principle, fsspec can completely subsume the current protocols (file and s3) and can provide many more. However, since it is a new integration and getting credentialling correct can be difficult, I implemented it as an alternate protocol which people can migrate to.

Some s3fs and gcsfs dependencies (installed via fsspec[s3,gcs]) do not have wheels for alpine, so they need to be built during startup. To correctly run the tests, the following extra install of dev packages is needed:

apk add musl-dev

Addresses #911

chrisroat avatar Aug 22 '21 22:08 chrisroat

The CI failures are due to problems building wheels for dependencies of s3fs and gcsfs. I believe this will require updates to the datajoint/pydev image to include additional dev packages.

chrisroat avatar Aug 22 '21 22:08 chrisroat

As per my comments above, I believe these failures need apk add musl-dev in datajoint/pydev in order to succeed. If you'd like to move forward with this, we can work on making that update and resolving conflicts.

chrisroat avatar Sep 14 '21 19:09 chrisroat

@chrisroat Thanks for the PR and good stuff here man! :rocket:

We are currently playing a bit of some catch-up on other regression issues and new team members (they've recently patched 2 issues you opened!) but we haven't forgotten about this PR. We may be a bit slow in our response/review while adjusting to this new cycle so we appreciate your patience with this.

Regarding the issues of the docker images here, we are also currently in the process of upgrading it to a new image stack so we will likely not continue using datajoint/pydev in favor of datajoint/djtest. A PR for that update will soon be coming addressing that along with fixing the broken datajoint/datajoint image. Regarding the external storage support for fsspec, is this purely a Python dependency or are other system dependencies involved?

guzman-raphael avatar Sep 15 '21 15:09 guzman-raphael

Yes, fsspec is python only (AFAIK). It actually doesn't seem to have any dependencies, as it is really just a specification.

There can be extra packages, depending on which underlying filesystem is being used (i.e. gcsfs for GCS, s3fs for S3) -- and those could have additional dependencies. Datajoint could include some of those packages/dependencies, or leave it up to the user to install what they need.

I'd personally lean toward having the user install what they need, but the fact that datajoint currently brings in S3 dependencies may have already set a different precedent.

chrisroat avatar Sep 15 '21 17:09 chrisroat

Is there still interest in this?

chrisroat avatar Jan 02 '22 04:01 chrisroat

I want to use my dropbox business account as external drive location to my datajoint schema. Since dropbox is not S3, this option is currently not available - but, as I understand, this PR + some additions (https://github.com/fsspec/dropboxdrivefs), would enable it? I'd be grateful for any tips.

horsto avatar Dec 27 '22 20:12 horsto

Ok, let's test and merge this and test with Dropbox. Thank you for finding this @horsto and thank you for your patience @chrisroat.

dimitri-yatsenko avatar Dec 27 '22 21:12 dimitri-yatsenko

@chrisroat DataJoint core team has opened some bandwidth and we are finally coming back to review the outstanding PRs that are in the queue. Thinking back on your contribution, I really like the idea of conforming to a single interface to simplify how we add support for new external storage options.

On that note, why did you choose fsspec as opposed to FUSE-based approaches? Is there an additional benefit to using the specification as you proposed?

Some additional context, we've now had much more experience using s3fs-fuse and it is demonstrating amazing performance! We've been mounting S3 BLOBs (>150GB) and accessing the files as if they were local from Python with read/write speeds consistent with online speedtests (900 Mbps in our HQ).

FUSE projects have been around for quite some time and the above in particular has: caching features, strong community acceptance, compatibility for STS, and more. Perhaps the common interface is simply treating objects like a normal file and letting FUSE do the rest? :shrug:

Curious on your thoughts.

guzman-raphael avatar Feb 12 '23 05:02 guzman-raphael

rclone seems like a very promising one as well.

guzman-raphael avatar Feb 12 '23 15:02 guzman-raphael

You can use fsspec with fuse-mounted filesystems, so I don't think they are competitors in that way. See the docs: https://filesystem-spec.readthedocs.io/en/latest/features.html#mount-anything-with-fuse

I think fsspec is heavily used in the pydata ecosystem (i.e. pandas, dask), so it is a natural to support it. You'll benefit from improvements and interoperability.

Regarding rclone, I note that someone even tried to make an fsspec implementation for it (it doesn't look supported, though): https://pypi.org/project/fsspec-rclone/

chrisroat avatar Feb 12 '23 16:02 chrisroat