zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Refactor storage around abstract file system?

Open rabernat opened this issue 7 years ago • 5 comments

We are recently seeing a lot of new proposals for new storage classes in zarr (e.g. #299, #294, #293, #252). These are all great ideas. Alternatively, we have several working storage layers (s3fs, gcsfs) that don't live inside zarr because they already provide a MutableMapping interface that zarr can talk to. The situation is fragmented, and we don't see to have a clear roadmap for how to handle all these different scenarios. There is some relevant discussion in #290.

I recently learned about pyfilesystem: "PyFilesystem is a Python module that provides a common interface to any filesystem." The index of supported filesystems provides analogs for nearly all of the builtin zarr storage options. Plus there are storage classes for cloud, ftp, dropbox, etc.

Perhaps one path forward would be to refactor zarr's storage to use pyfilesystem objects. We would only really need a single storage class which wraps pyfilesystem and provides the MutableMapping that zarr uses internally. Then we could remove 80% of storage.py that deals with listing directories, zip files, etc, since this would be handled by pyfilesystem.

Once we had a generic filesystem, we could then create a Layout layer, which describes how the zarr objects are laid out within the filesystem. For example, today, we already have two de-facto layouts: DirectoryStore and NestedDirectoryStore. We could consider others. For example, one with all the metadata in a single file (e.g. #294). The Layout and the Filesystem could be independent from one another.

For new storage layers like mongodb, redis, etc., we would basically just say, "go implement a pyfilesystem for that". This has the advantage of

  • reducing the maintenance burden in zarr
  • providing more general filesystem objects (that can also be used outside of zarr)

The only con I can think of is performance: it is possible that the pyfilesystem implementations could have worse performance than the zarr built-in ones. But this cuts both ways: they could also perform better!

I know this implies a fairly big refactor of zarr. But it could save us lots of headaches in the long run.

rabernat avatar Sep 23 '18 18:09 rabernat

Thanks Ryan, pyfilesystem certainly looks like something we should investigate. And I am certainly open to this general approach if the underlying libraries are well maintained and performant and we can still optimise for zarr usage patterns if/where needed.

Just to note here that this approach is basically what @martindurant has been arguing for, albeit with the underlying filesystem abstraction and implementations being different. @martindurant what's your view of this?

FWIW I think it will take some time to get enough experience to make a firm decision in this direction, so I think we should be prepared to live with a mixture of approaches and some duplication of effort for a while. Obviously in the long run we should aim to consolidate efforts and remove redundancy as much as possible.

Also various people (including me) have found it pleasantly straightforward to implement the MutableMapping interface directly for a new storage backend, so we shouldn't ignore those positive feelings. Maybe implementing the pyfilesystem API is similarly straightforward, I don't have the experience.

On Sun, 23 Sep 2018, 20:28 Ryan Abernathey, [email protected] wrote:

We are recently seeing a lot of new proposals for new storage classes in zarr (e.g. #299 https://github.com/zarr-developers/zarr/issues/299, #294 https://github.com/zarr-developers/zarr/issues/294, #293 https://github.com/zarr-developers/zarr/pull/293, #252 https://github.com/zarr-developers/zarr/pull/252). These are all great ideas. Alternatively, we have several working storage layers (s3fs, gcsfs) that don't live inside zarr because they already provide a MutableMapping interface that zarr can talk to. The situation is fragmented, and we don't see to have a clear roadmap for how to handle all these different scenarios. There is some relevant discussion in #290 https://github.com/zarr-developers/zarr/issues/290.

I recently learned about pyfilesystem https://www.pyfilesystem.org/: "PyFilesystem is a Python module that provides a common interface to any filesystem." The index of supported filesystems https://www.pyfilesystem.org/page/index-of-filesystems/ provides analogs for nearly all of the builtin zarr storage options. Plus there are storage classes for cloud, ftp, dropbox, etc.

Perhaps one path forward would be to refactor zarr's storage to use pyfilesystem objects. We would only really need a single storage class which wraps pyfilesystem and provides the MutableMapping that zarr uses internally. Then we could remove 80% of storage.py that deals with listing directories, zip files, etc, since this would be handled by pyfilesystem.

Once we had a generic filesystem, we could then create a Layout layer, which describes how the zarr objects are laid out within the filesystem. For example, today, we already have two de-facto layouts: DirectoryStore and NestedDirectoryStore. We could consider others. For example, one with all the metadata in a single file (e.g. #294 https://github.com/zarr-developers/zarr/issues/294). The Layout and the Filesystem could be independent from one another.

For new storage layers like mongodb, redis, etc., we would basically just say, "go implement a pyfilesystem for that". This has the advantage of

  • reducing the maintenance burden in zarr
  • providing more general filesystem objects (that can also be used outside of zarr)

The only con I can think of is performance: it is possible that the pyfilesystem implementations could have worse performance than the zarr built-in ones. But this cuts both ways: they could also perform better!

I know this implies a fairly big refactor of zarr. But it could save us lots of headaches in the long run.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/301, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qv6wTqlhuTxwdzWk5Pw0CzSKG0gQks5ud9LmgaJpZM4W10WR .

alimanfoo avatar Sep 23 '18 19:09 alimanfoo

Yes @martindurant's filesystem_spec is what turned me on to pyfilesytem! (They are discussing similarities here: https://github.com/martindurant/filesystem_spec/issues/5)

I don't particularly care which abstract filesystem we pick--it's the principle of outsourcing this functionality to some other, more general software layer. pyfilesystem appears to be pretty mature. But of course I defer to @martindurant's recommendations--he is the real expert on this stuff!

rabernat avatar Sep 23 '18 19:09 rabernat

I am, naturally, not an unbiased observer here. Firstly, let me say that my fsspec is an aspirational project without any users as things stand, where as pyfilesystems is established and used by some people. However, I did consider building within the pyfilesystems framework and discussed some of what I see as shortfalls with them, but did not arrive at a satisfactory solution. Note that, although their interfaces to things like DropBox are interesting, I would say the only "cloud" interface they have is S3, which works by downloading whole files, doesn't give you random access to just chunks (please, someone correct me if I am wrong).

The motivation for fsspec came out of the similarity between the projects I had been involved in (s3fs, gcsfs, adlfs, hdfs3) and the need for abstraction across them in the context of dask. It is important, for instance, that file-system objects be serialisable, so that they can be passed between client and workers; also, I wrote MutibleMapping interfaces and FUSE backends. These projects had similar, but not identical APIs, and a certain amount of shim code was required, which ended up within dask, as well as interfacing to arrow's file-systems. For the latter, only JNI hdfs is of note here, although arrow has it's own concept of a file-system class hierarchy and a local files implementation. In any case, such code doesn't really belong in dask, is generally useful: for example, the lazyness of an OpenFile, which can give a text interface to remote compressed data. It would also be in dask's interests to not have to write and maintain such code.

All that being said, we are all I think pragmatists, with limited resources. I can also help to try to leverage pyfilesystems, maybe, if people prefer that route.

I, of course, like my design and am prepared to defend certain decisions, but it is not useful without having all the backends of interest conform to it and, ideally, including the file-handling code that is not dask-specific from dask. As you can see in the code, I have made an effort to meet multiple standards such as the python stdlib and posix naming schemes, provide walk and glob for any backend that can do ls, have a concept of "transactional" operations (files all moved to destination or made concrete only when transaction is done, or discarded if cancelled).

martindurant avatar Sep 23 '18 23:09 martindurant

@martindurant -- thanks for the clarifications. I misunderstood the thread over in filesystem_spec discussing the relationship with pyfilesystem. I thought they were more similar than they really are, and that compatibility of api's was on the horizon. (I only just discovered pyfilesystem and clearly do not understand it well.) I have changed the name of this issue to reflect the fact that we are talking generically about some sort of filesystem abstraction.

I appreciate all the work you have put into your cloud storage classes. They are excellent and very useful for zarr. It would be great to build on that success and factor more of the filesystem "details" out of zarr itself.

rabernat avatar Sep 24 '18 00:09 rabernat

So looking back: the goal of this issue would be roughly equivalent to making FSStore the default?

joshmoore avatar Sep 22 '21 15:09 joshmoore