huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

Add interface to `fsspec`

Open mariosasko opened this issue 3 years ago • 15 comments

Adds an interface to fsspec to enable access to the Hub as if it were a local file system.

This implementation uses the following scheme to infer parameters required for the protocol class initialization from a remote URL: hf://{repo_id}[@{revision}]:/path_in_repo

For instance, this means one can simply push a Pandas dataframe to the Hub as follows:

df.to_csv("hf://repo_id:/path_in_repo", storage_options={"repo_type": "dataset"})

Things to discuss

  • if open's mode is "w", the current implementation doesn't create a repo to store the file if it doesn't already exist and throws an error instead. Let me know if you think it should create a repo.
  • also, feel free to suggest a better protocol scheme

Notes

  • this PR builds upon datasets' HfFileSystem (that version doesn't expose methods that can modify a Hub repo)
  • (for local testing) currently requires calling fsspec.register_implementation(HfFileSystem.protocol, HfFileSystem) to register the hf protocol in the local registry -> as soon as this PR is merged I'll add the protocol to fsspec's "official" registry to make this step unnecesarry

TODOs

  • add docs in a subsequent PR
  • add the hf protocol to fsspec's "official" registry

mariosasko avatar Jul 07 '22 13:07 mariosasko

The documentation is not available anymore as the PR was closed or merged.

@lhoestq Yes, I think having a dedicated package is also OK. huggingface_hub usually stores Hub integrations (and is well-established), so this is the reason why I put the code here. But as long as we register this implementation in the "official" registry, discoverability shouldn't be a problem anyways.

Still, I would like to hear what others think.

mariosasko avatar Jul 07 '22 15:07 mariosasko

cc @adrinjalali @osanseviero do you have an opinion about whether this class should be in huggingface_hub or in a separate package hffs ?

lhoestq avatar Jul 20 '22 13:07 lhoestq

If it's going to depend on huggingface_hub then I have no preference. But as a user, I'd expect a package like hffs to be or to depend on a package much smaller than hfh. So no strong preference here on my side.

adrinjalali avatar Jul 20 '22 14:07 adrinjalali

Given that it requires an extra dependency fsspec and for consistency with the other filesystems contributed by the community, I'd be down to create the hffs package

lhoestq avatar Jul 20 '22 15:07 lhoestq

no strong opinion either

Given that it's not a lot of code I think having it here would be fine, for simplicity purposes.

Does fsspec bring other transitive dependencies, or is it dependency-less?

julien-c avatar Jul 20 '22 15:07 julien-c

Does fsspec bring other transitive dependencies, or is it dependency-less?

It has no dependencies

lhoestq avatar Jul 20 '22 16:07 lhoestq

what do you think @osanseviero @Pierrci @SBrandeis?

julien-c avatar Jul 20 '22 16:07 julien-c

Can we publish a "subpart" of huggingface_hub as a package on PyPI? Maybe it makes sense to have two separate packages but to maintain them in the same repo (huggingface_hub)? I don't know if this is doable or desirable.

SBrandeis avatar Jul 21 '22 08:07 SBrandeis

Can we publish a "subpart" of huggingface_hub as a package on PyPI?

I've also argued in favor of this in the past, but folks have made the argument that it might not get enough attention if it's a small library. I like the idea of a huggingface_hub-core which only includes the REST wrappers and probably the git interfaces. If that happens, we can also re-work the git interface part.

adrinjalali avatar Jul 21 '22 08:07 adrinjalali

Do you expect users to use this feature (or hffs) directly, or for it to be used under the hood in libraries such as datasets?

We already have a similar class in datasets (read-only though) that we use to run glob patterns in dataset repositories.

I think the main usage is going to be

  1. for pandas/dask users to read/write to HF Hub (this also includes the datasets-server project, which is likely to use dask for distributed processing)
  2. for HF users training models and streaming data from the Hub (via datasets or others like torch datapipes)
  3. for HF power users who need to do advanced manipulations in repos

lhoestq avatar Jul 21 '22 10:07 lhoestq

Even though the implementation is pretty simple at the moment, it can quickly become more complex if we decide to override some more fsspec methods, add support for async IO, etc., so I think it's OK to have a separate repo to avoid unnecessary clutter in this repo (I'm also not a fan of having multiple packages/submodules in this repo tbh) and to follow fsspec's convention.

mariosasko avatar Jul 21 '22 12:07 mariosasko

Maybe it makes sense to have two separate packages but to maintain them in the same repo (huggingface_hub)? I don't know if this is doable or desirable.

I would avoid two libraries in this same repo as this has led to confusion in the past + maintenance becomes a bit trickier.

I don't have a strong opinion between having the code within huggingface_hub library or if it should be consistent with rest of ecosystem hffs, but hffs makes a bit more sense I think.

osanseviero avatar Jul 21 '22 13:07 osanseviero

@lhoestq I've made some changes (namely the use of self.dircache for caching) to be more consistent with the existing implementations. Let me know what you think.

mariosasko avatar Jul 25 '22 15:07 mariosasko

created the repo here: https://github.com/huggingface/hffs

lhoestq avatar Jul 27 '22 16:07 lhoestq

Hey everyone 👋 It seems you all came to the conclusion to continue the work in the separate repo https://github.com/huggingface/hffs. Does that mean we can close this PR then ?

Btw I really like the df.to_csv("hf://repo_id:/path_in_repo") syntax ! It looks super cool 🔥

Wauplin avatar Aug 17 '22 13:08 Wauplin