huggingface_hub
huggingface_hub copied to clipboard
Add interface to `fsspec`
Adds an interface to fsspec to enable access to the Hub as if it were a local file system.
This implementation uses the following scheme to infer parameters required for the protocol class initialization from a remote URL:
hf://{repo_id}[@{revision}]:/path_in_repo
For instance, this means one can simply push a Pandas dataframe to the Hub as follows:
df.to_csv("hf://repo_id:/path_in_repo", storage_options={"repo_type": "dataset"})
Things to discuss
- if
open's mode is"w", the current implementation doesn't create a repo to store the file if it doesn't already exist and throws an error instead. Let me know if you think it should create a repo. - also, feel free to suggest a better protocol scheme
Notes
- this PR builds upon
datasets'HfFileSystem(that version doesn't expose methods that can modify a Hub repo) - (for local testing) currently requires calling
fsspec.register_implementation(HfFileSystem.protocol, HfFileSystem)to register thehfprotocol in the local registry -> as soon as this PR is merged I'll add the protocol tofsspec's "official" registry to make this step unnecesarry
TODOs
- add docs in a subsequent PR
- add the
hfprotocol tofsspec's "official" registry
The documentation is not available anymore as the PR was closed or merged.
@lhoestq Yes, I think having a dedicated package is also OK. huggingface_hub usually stores Hub integrations (and is well-established), so this is the reason why I put the code here. But as long as we register this implementation in the "official" registry, discoverability shouldn't be a problem anyways.
Still, I would like to hear what others think.
cc @adrinjalali @osanseviero do you have an opinion about whether this class should be in huggingface_hub or in a separate package hffs ?
If it's going to depend on huggingface_hub then I have no preference. But as a user, I'd expect a package like hffs to be or to depend on a package much smaller than hfh. So no strong preference here on my side.
Given that it requires an extra dependency fsspec and for consistency with the other filesystems contributed by the community, I'd be down to create the hffs package
no strong opinion either
Given that it's not a lot of code I think having it here would be fine, for simplicity purposes.
Does fsspec bring other transitive dependencies, or is it dependency-less?
Does fsspec bring other transitive dependencies, or is it dependency-less?
It has no dependencies
what do you think @osanseviero @Pierrci @SBrandeis?
Can we publish a "subpart" of huggingface_hub as a package on PyPI?
Maybe it makes sense to have two separate packages but to maintain them in the same repo (huggingface_hub)?
I don't know if this is doable or desirable.
Can we publish a "subpart" of
huggingface_hubas a package on PyPI?
I've also argued in favor of this in the past, but folks have made the argument that it might not get enough attention if it's a small library. I like the idea of a huggingface_hub-core which only includes the REST wrappers and probably the git interfaces. If that happens, we can also re-work the git interface part.
Do you expect users to use this feature (or hffs) directly, or for it to be used under the hood in libraries such as datasets?
We already have a similar class in datasets (read-only though) that we use to run glob patterns in dataset repositories.
I think the main usage is going to be
- for pandas/dask users to read/write to HF Hub (this also includes the datasets-server project, which is likely to use dask for distributed processing)
- for HF users training models and streaming data from the Hub (via
datasetsor others like torch datapipes) - for HF power users who need to do advanced manipulations in repos
Even though the implementation is pretty simple at the moment, it can quickly become more complex if we decide to override some more fsspec methods, add support for async IO, etc., so I think it's OK to have a separate repo to avoid unnecessary clutter in this repo (I'm also not a fan of having multiple packages/submodules in this repo tbh) and to follow fsspec's convention.
Maybe it makes sense to have two separate packages but to maintain them in the same repo (huggingface_hub)? I don't know if this is doable or desirable.
I would avoid two libraries in this same repo as this has led to confusion in the past + maintenance becomes a bit trickier.
I don't have a strong opinion between having the code within huggingface_hub library or if it should be consistent with rest of ecosystem hffs, but hffs makes a bit more sense I think.
@lhoestq I've made some changes (namely the use of self.dircache for caching) to be more consistent with the existing implementations. Let me know what you think.
created the repo here: https://github.com/huggingface/hffs
Hey everyone 👋 It seems you all came to the conclusion to continue the work in the separate repo https://github.com/huggingface/hffs. Does that mean we can close this PR then ?
Btw I really like the df.to_csv("hf://repo_id:/path_in_repo") syntax ! It looks super cool 🔥