huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

[RFC] Proposal for a way to cache files in downstream libraries

Open Wauplin opened this issue 3 years ago • 7 comments

This is a proposal following discussions started with the datasets team (@lhoestq @albertvillanova).

The goal is to have a proper way to cache any kind of files from a downstream library and manage them (e.g.: scan and delete) from huggingface_hub. From hfh's perspective, there is not much work to do. We should have a canonical procedure to generate cache paths for a library. Then within a cache folder, the downstream library handles its files as it wants. Once this helper starts to be used, we can adapt the scan-cache and delete-cache commands.

I tried to document the extra_cache_folder() helper to describe the way I see it. Any feedback is welcomed, this is really just a proposal. All the examples are very datasets-focused but I think this could benefit to other libraries as transformers (@sgugger @LysandreJik ), diffusers (@apolinario @patrickvonplaten) or skops (@adrinjalali @merveenoyan) to store any kind of intermediate files. IMO the difficulty mainly resides in making the feature used :smile:.

EDIT: see generated documentation here.

WDYT ?

(cc @julien-c @osanseviero as well)

Example:

>>> from huggingface_hub import extra_cache_folder

>>> extra_cache_folder(library_name="datasets", namespace="SQuAD", subfolder="download")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/download')

>>> extra_cache_folder(library_name="datasets", namespace="SQuAD", subfolder="extracted")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/extracted')

>>> extra_cache_folder(library_name="datasets", namespace="SQuAD")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/SQuAD/default')

>>> extra_cache_folder(library_name="datasets", subfolder="modules")
PosixPath('/home/wauplin/.cache/huggingface/extra/datasets/default/modules')

>>> extra_cache_folder(library_name="datasets", cache_dir="/tmp/tmp123456")
PosixPath('/tmp/tmp123456/datasets/default/default')

And the generated tree:

    extra/
    ├── datasets/
    │   ├── default/
    │   │   ├── modules/
    │   ├── SQuAD/
    │   │   ├── downloaded/
    │   │   ├── extracted/
    │   │   └── processed/
    │   ├── Helsinki-NLP--tatoeba_mt/
    │       ├── downloaded/
    │       ├── extracted/
    │       └── processed/
    └── transformers/
        ├── default/
        │   ├── something/
        ├── bert-base-cased/
        │   ├── default/
        │   └── training/
    hub/
    └── models--julien-c--EsperBERTo-small/
        ├── blobs/
        │   ├── (...)
        │   ├── (...)
        ├── refs/
        │   └── (...)
        └── [ 128]  snapshots/
            ├── 2439f60ef33a0d46d85da5001d52aeda5b00ce9f/
            │   ├── (...)
            └── bbc77c8132af1cc5cf678da3f1ddf2de43606d48/
                └── (...)

Wauplin avatar Sep 27 '22 14:09 Wauplin

The documentation is not available anymore as the PR was closed or merged.

Codecov Report

Base: 84.91% // Head: 84.99% // Increases project coverage by +0.07% :tada:

Coverage data is based on head (1185dbe) compared to base (4bd9ebf). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1088      +/-   ##
==========================================
+ Coverage   84.91%   84.99%   +0.07%     
==========================================
  Files          38       39       +1     
  Lines        3931     3951      +20     
==========================================
+ Hits         3338     3358      +20     
  Misses        593      593              
Impacted Files Coverage Δ
src/huggingface_hub/constants.py 91.48% <100.00%> (+0.37%) :arrow_up:
src/huggingface_hub/utils/__init__.py 100.00% <100.00%> (ø)
src/huggingface_hub/utils/_cache_assets.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Sep 27 '22 14:09 codecov[bot]

Cool thanks !

Can it even be possible to have the "datasets" folder one level higher ?

    datasets/
    └── squad/
        ├── downloaded/
        ├── extracted/
        └── arrow/
    hub/
    └── models--julien-c--EsperBERTo-small/
        ├── blobs/
        ├── refs/
        └── [ 128]  snapshots/

Then there's "modules" (top level as well) which is the directory where we copy python scripts. IMO this one doesn't need to be part of this change (for now ?)

For context:

The "modules" directory is added to the python path in datasets and evaluate at run-time once. Inside "modules" we define "datasets_modules" and "evaluate_modules" that can be imported with importlib.import_module("datasets_modules") for example. And finally inside "datasets_modules" we have the datasets scripts that are imported as submodules of "datasets_modules".

I think we keep using one directory "modules" at the top level, this way any HF lib only need to add one directory in the python path. For now it can be excluded from the HFH cache since it only contains small files

lhoestq avatar Sep 27 '22 14:09 lhoestq

We also use the modules folder in the cache to put a transformers_module containing the code of dynamic models/tokenizers/pipelines (that have their code on the Hub).

sgugger avatar Sep 27 '22 15:09 sgugger

Thanks for the feedback :) To answer your questions 1 by 1:

Can it even be possible to have the "datasets" folder one level higher ?

What would be the goal of having it at the root of the cache ? The main advantage when I thought about it was that since it is a new folder in the cache, it would not conflict with any of the existing libraries (datasets, transformers and others). In any case, the downstream library shouldn't care too much where the cache it is only a path returned from hugginface_hub. But of course, I can be convinced here if it makes sense.

Then there's "modules" (top level as well) which is the directory where we copy python scripts.

I am fine with keeping the modules folder at root level. This folder will in any case be quite small, right ? (only python script so no data to delete to save space). But in next iterations, I don't see huge difference with having them under cache/extra/datasets/default/modules/ and cache/extra/transformers/default/modules/ (except not changing the existing code).

IMO this one doesn't need to be part of this change (for now ?)

In any case, as I see it, the huggingface_hub update would just propose a canonical path where to cache stuff. Since stuff is already implemented in datasets, transformers (and cie), we don't have to change everything at once. Sections of the cache can be iteratively moved to the extra/ cache but there is no rush in that.

The "modules" directory is added to the python path in datasets and evaluate at run-time

Does datasets and evaluate share the same python module folder ? If (and only if) we move them to the extra/ cache, we could point evaluate to import for the extra/datasets/modules/ folder since we maintain both librairies. In general I think it's best if each library uses only its own cache but modules is a quite specific use case.

I think we keep using one directory "modules" at the top level, this way any HF lib only need to add one directory in the python path We also use the modules folder in the cache to put a transformers_module containing the code of dynamic models/tokenizers/pipelines

Yep, very specific use case (and already existing) so no need to change IMO.

Wauplin avatar Sep 27 '22 15:09 Wauplin

What would be the goal of having it at the root of the cache ?

Users of datasets that played with the cache before would expect their data to be in .cache/huggingface/datasets as before I think - or at least in a directory at the same level with a clear name ("datasets-cache" or something like that).

Therefore if the main concern is to avoid conflicts I think we can just rename it "datasets-cache"

I am fine with keeping the modules folder at root level. This folder will in any case be quite small, right ?

Yup it's quite small (usually a few KB per dataset)

Does datasets and evaluate share the same python module folder ?

Both add .cache/huggingface/modules in the python path. But in practice they use different folders inside "modules": "datasets_modules" for datasets and "evaluate_modules" for evaluate. So their caches are different, and each lib handles its cache its own way. And we only need to add one directory to python path.

lhoestq avatar Sep 27 '22 16:09 lhoestq

Therefore if the main concern is to avoid conflicts I think we can just rename it "datasets-cache"

I would also like to avoid conflicts with existing folders for the future scan-cache and delete-cache tools. If everything lives under the extra/ directory, I would not have to care about special folders that would not use the same structure like modules or doc_builder. For now I only have those 2 examples on my machine but idk if there are more. Having everything in a separate folder extra/ on which huggingface_hub is the only lib to deal with the structure (at least for the first 3 levels of depths, then under cache/extra/datasets/SQuAD/extracted you can put anything you want) would avoid edge cases.

Also, we it would allow to have a clear distinction between folders that can be scanned via huggingface-cli and the others. If we have datasets/ and datasets-cache/ together next to each other I'd find it more confusing actually. WDYT ?

Yup it's quite small (usually a few KB per dataset) Both add .cache/huggingface/modules in the python path

Then let's not focus too much on that. It light-weight and already working so no need to touch it.

Wauplin avatar Sep 28 '22 09:09 Wauplin

(nit) I just thought about it but renaming extra/ to assets/ sounds better I'd say (I would expect users/devs to understand it better). And cached_assets_folder as helper name.

Wauplin avatar Sep 29 '22 05:09 Wauplin

Ok I see ! I like assets more as well. I'm ok with going for assets then (I just remembered we had a quite explicit message printed every time a dataset is loaded to show its cache location anyway, so it should do the job)

EDIT: hmm actually let me think more about this

lhoestq avatar Sep 29 '22 08:09 lhoestq

Here is another idea we had with @polinaeterna

    datasets-cache/
    └── squad/
        ├── repository/
        │   ├── blobs/
        │   ├── refs/
        │   └── [ 128]  snapshots/
        ├── downloaded/
        ├── extracted/
        ├── python/
        └── arrow/

In my opinion this is ideal for a datasets users: you can delete all your cache in one command rm -rf datasets-cache and delete everything related to one specific dataset in one command rm -rf datasets-cache/<name>

This is a bit different from what's currently done in huggingface_hub but I think it would be awesome

lhoestq avatar Oct 06 '22 14:10 lhoestq

yes this structure is good for us (datasets) and it means that scripts are stored not in modules/datasets_modules but in a specific dataset folder (like squad/python)

but it would require either adding both datasets-cache and evaluate's modules cache folders to python path which might lead to collisions, or higher level ~/.cache/huggingface directory, but this way it might lead to collision with transformers, though the latter seems unlikely because it shouldn't contain __init__.py, as far as I understand, so tell me if I'm wrong.

polinaeterna avatar Oct 06 '22 16:10 polinaeterna

@lhoestq @polinaeterna I'm sorry but I think this will be unlikely to happen exactly like this. Repos from the Hub are now cached under ~/.cache/huggingface/hub no matter which library is using it. Even though I understand to goal of having everything at the same place, including the repo clone (blobs, refs and snapshots), I think this would be too datasets-specific.

I see at least 2 reasons to keep the repository/ folder in the hub/ folder:

  1. It is consistent with what is done with models and spaces (and maybe more in the future ?). This is less confusing for users and implies less case-specific code to maintain. Also benefit from the scan/delete-cache CLI commands to delete only specific revisions of a dataset (is it as common for a dataset than for a model to have multiple revisions with big changes ?).
  2. Others scripts/libraries could also cache datasets from the Hub without using the datasets library. If so, should we download twice the repo (once in datasets-cache/ and once in hub/) ? Or use datasets-cache/ even though the script is not specific from the datasets library ? Both feels wrong to me. I think one of the main reason to start having a unified hub/ folder for any repo was to specifically avoid that (no more transformers-specific cache for example).

That been said, it is still possible to move the python/ folder inside the datasets cache. I also found it cleaner but I thought that was not possible because it requires to add the folder to the python path.

In the end, you can would end up with something like this:

.cache/huggingface/
├─ assets/
│  └─ datasets/
│     └── squad/
│          ├── downloaded/
│          ├── extracted/
│          ├── python/
│          └── arrow/
└─ hub/
   └── datasets--squad/
       ├── blobs/
       ├── refs/
       └── snapshots/

rm -rf ./cache/assets/datasets/squad is still possible but would require to delete the squad cached repo separately.

Would that be ok ?

Wauplin avatar Oct 11 '22 10:10 Wauplin

Hi @Wauplin,

Indeed yesterday we had a meeting of the datasets team and we discussed about this topic. IMO, I totally understand your point: so that we keep current behavior of the hub cache and just add the required specificity by datasets with as less impact as possible to current implementation (in a different directory).

The request by @lhoestq and @polinaeterna of putting everything in a per-dataset directory, has an underlying need: to be able to easily find/delete all cached files of a specific dataset:

  • with their approach, this can easily be done by listing/removing the corresponding dataset directory
  • with your approach, that will be possible by using the scan-cache and delete-cache functions:

Therefore, if we agree on your proposed directory structure:

  • can we list/delete all cached files of a specific datasets, regardless of whether the files are in one directory (hub) or another (extra or assets)?

Additionally, for the "modules" directory, I think you were right (@lhoestq could you confirm this?): they need to be in a separate parent directory (that will be added to the import path), for all datasets: datasets_modules/. Therefore, not possible to have it in a per-dataset directory (squad/python/). Also note that these files are a copy of the previously cached files in your "hub/" directory (scripts are downloaded from the Hub).

Another remark: please note the specificity of the "downloaded" directory:

  • until now, we put there whatever file we downloaded, either from the Hub or from 3rd party servers
  • with the new implementation, the "downloaded" should only be used for 3rd party servers; files downloaded from the Hub will be in your "hub/" directory

I think this difference must be implemented downstream by datasets team.

albertvillanova avatar Oct 11 '22 16:10 albertvillanova

I see, thanks @Wauplin and @albertvillanova :) sounds good to me

super excited about this !

Additionally, for the "modules" directory, I think you were right (@lhoestq could you confirm this?): they need to be in a separate parent directory (that will be added to the import path), for all datasets: datasets_modules/. Therefore, not possible to have it in a per-dataset directory (squad/python/). Also note that these files are a copy of the previously cached files in your "hub/" directory (scripts are downloaded from the Hub).

We could also add the assets directory to sys.path, and name its subdirectory datasets-cache to avoid a module name collision with datasets, but I don't have a strong opinion on this - we can keep the modules dir if we want

lhoestq avatar Oct 12 '22 10:10 lhoestq

We could also add the assets directory to sys.path, and name its subdirectory datasets-cache to avoid a module name collision with datasets, but I don't have a strong opinion on this - we can keep the modules dir if we want

Not so much opinionated about this but if you do so I think you should align with the transformers team as well. Worth noticing that assets/ can be managed by any downstream library (e.g. any lib will be able to have its own directory under assets/) so maybe it's not ideal to add everything to the python path.

Wauplin avatar Oct 12 '22 10:10 Wauplin

I see - let's keep modules as it is then

lhoestq avatar Oct 12 '22 10:10 lhoestq

Ok everyone, thanks for the feedback and discussions. I think we reached a common agreement for this PR then. When it comes to the tools to scan/delete those assets, we can discuss them in a separate issue. They will come once we start to integrate this assets folder with datasets and other downstream libraries :)

@LysandreJik @lhoestq I've ping you for the review but other contributors are welcomed to review if wanted. Thanks in advance !

Wauplin avatar Oct 12 '22 15:10 Wauplin

Thanks for the review @LysandreJik , I'm merging :)

Wauplin avatar Oct 13 '22 08:10 Wauplin

Awesome thanks :)

lhoestq avatar Oct 13 '22 08:10 lhoestq