Hugging Face integration
The :hugs: Hugging Face Hub intends to facilitate the hosting and sharing of AI models and datasets (as well as demo applications), and now also NatLibFi has an organization account in the Hugging Face Hub.
The data (models and datasets) in the HF Hub live in git repositories, and git can be used to handle the data (to commit, push, pull...) . However, also direct integration of applications with HF Hub is supported using the huggingface_hub Python library, which is usable also as a CLI tool.
Annif could have the functionality to push (and pull) projects or project sets to (and from) the HF Hub. It should to be able to operate on project sets because ensemble projects require the availability of also its base projects and also because of convenience.
There could be the following CLI command to push a set of projects to HF Hub:
annif upload-projects <glob-pattern> <username/reponame> [--options]
For example
annif upload-projects yso-*fi NatLibFi/FintoAI-data-YSO
would upload the specified projects to NatLibFi/FintoAI-data-YSO repository.
The files and dirs needed to be uploaded are
-
data/projects/project-idthe project directories -
data/vocabs/vocab-idvocabularies of the projects -
projects.{cfg,toml,d}configurations of the projects
Options for bundling and uploading
1. Single file
Bundle all files into one zip named: yso-fi.zip (possibly include only the configs of the selected projects). Upload to the root of the repo.
The filename could be derived by the glob pattern of the projects or it could be a required argument for the upload command (as 2nd argument, to be added to the above example).
This option would be easiest for downloads: just wget one file and unzip.
2. One file for projects and vocab, and one for projects configs
Bundle projects and vocabulary directories into one zip and leave projects config file uncompressed.
3. One file for projects, one for vocab, and one for projects configs
Bundle the selected projects into one zip (yso-fi.zip) and vocabularies into another (yso.zip) and leave projects config file uncompressed. Upload the projects zip to data/projects directory and the vocab zip to data/vocabs.
4. Separate files for each project, vocab, and projects configs
Compress each project directory into its own zip (<project-id>.zip).
For this option for downloads one should use e.g. wget --accept yso*-fi.zip for the projects.
Some details and ideas:
- There exists the
upload_filemethod in the Python client library that could be used for this. - There exists also a
ModelHubMixinclass which could help the integration. - Authentication to HF Hub could be performed with a user access token, which can be saved locally using
huggingface-cli logincommand. - The options could support a subset of options of the
huggingface-cli uploadCLI command (for commit message etc.).
Downloading projects
We could also implement a feature to fetch projects from the HF Hub, for example:
annif download-project <username/reponame> <projects-set-file>[--options]
But implementing this is probably best done only after the upload functionality; downloading from the HF Hub can be done also by simply with wget or curl. However, if the download function is known to be added, the hierarchy and structure of the data files in the repo should be thought from this point of view.
Very excited to see this! Feel free to ping me if you need any support with anything on the HF side :)