Annif icon indicating copy to clipboard operation
Annif copied to clipboard

Implement `annif upload-projects` and `download-projects` commands for Hugging Face Hub integration

Open juhoinkinen opened this issue 1 year ago • 10 comments

The organization of the uploaded files follows the 4. option of issue #760:

  1. Separate files for each project, vocab, and projects configs

Compress each project directory into its own zip file (in projects/<project-id>.zip in the repo).

Each vocabularies zip is placed in vocabs/<vocab-id>.zip, and each project configuration is placed in <project-id>.cfg in the repo root.

This option is good for caching by preventing unnecessary uploads/downloads of the projects bundle when only one is changed, and for visibility of the repo contents.

The unzip after download places the data directories directly to data/projects/ and data/vocabs/, and the project configurations as separate <project-id>.cfg files in projects.d/ directory, so after download the projects are directly usable (if projects.{cfg,toml} does not exists or by using the --projects option).

Upload

Push a set of selected projects and vocabularies to a Hugging Face Hub repository as zip files and the configurations of the projects as cfg/ini files. For example, the following command uploads all projects with ids matching *-fi to juhoinkinen/Annif-models-upload-testing:

annif upload-projects "*-fi" juhoinkinen/Annif-models-upload-testing --commit-message "[OptMsg] Upload all Finnish projects"

Download

Downloads the project and vocabulary archives and the configuration files of the projects that match the given and unzip the archives to data/ directory and places the configuration files to projects.d/ directory:

annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing

Note that currently the download will overwrite the project and vocab dirs if they already exist. A git revision (commit hash, branch etc.) can be specified with --revision option.

The downloaded files remain in the huggingface_hub client cache dir.


For all uploads and downloads in the case of private repos the user needs to have logged in with huggingface-cli login, or the HF Hub token can be given also with the --token option of this Annif command.

juhoinkinen avatar Feb 01 '24 13:02 juhoinkinen

The current approach follows the option 3 of issue #760:

3. One file for projects, one for vocab, and one for projects configs

Bundle the selected projects into one zip (yso-fi.zip) and vocabularies into another (yso.zip) and leave projects config file uncompressed. Upload the projects zip to data/projects directory and the vocab zip to data/vocabs.

But the files are uploaded just to the root of the repo with names projects.zip, vocabs.zip and projects.cfg.

One option for naming the files in the repo would be that the current (above names) is the default, but the name-in-repo-id could be given as an optional (3rd) parameter, -yso-fi in the following:

annif upload-projects *-fi juhoinkinen/Annif-models-upload-testing -yso-fi

This would create projects-yso-fi.zip, vocabs-yso-fi.zip, and projects-yso-fi.cfg in the repo. This (optionally renaming uploaded file and parameters order) follows the behaviour of huggingface-cli upload (although there the repoid is the 1st and local_path is 2nd argument; should that order be used in Annif command too...?).

Edit: Or make a directory named upload in the repo, optionally with a custom name given as the 3rd parameter, and upload projects.zip, vocabs.zip and projects.cfg there? Or restrict each repo to be used only for one set of projects, i.e. create repos FintoAI-data-YSO-fi, FintoAI-data-YSO-sv, FintoAI-data-YSO-en?


Now the project configs of the selected projects are written to a cfg file in INI format, whatever the source format is (toml, ini, directory). This is because detecting the original format (or even the path to the projects config file!) did not seem very easy, and also because the standard library has no functionality for writing TOML.

juhoinkinen avatar Feb 01 '24 13:02 juhoinkinen

Codecov Report

Attention: Patch coverage is 99.69697% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 99.64%. Comparing base (df20982) to head (6f35fff). Report is 47 commits behind head on main.

Files Patch % Lines
annif/hfh_util.py 99.30% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #762      +/-   ##
==========================================
- Coverage   99.65%   99.64%   -0.02%     
==========================================
  Files          89       91       +2     
  Lines        6404     6768     +364     
==========================================
+ Hits         6382     6744     +362     
- Misses         22       24       +2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 06 '24 16:02 codecov[bot]

TODO:

  • [x] Try to find a way to avoid the need to use quotes around projects-id pattern (without them shell expansion kicks in) Edit: Not possible, or at least worth it, see this SO answer.
  • [x] Add --force option to allow overwrite of existing files when downloading; prevent overwrite otherwise
  • [x] Add tests: could use mocking of huggingface_hub methods to assert their calls, e.g. @mock.patch("huggingface_hub.commands.upload.HfApi.upload_file")
  • [x] Speed up CLI startup by moving huggingface_hub imports to functions
  • [x] Update to huggingface-hub 0.21.*
  • [x] Add a page about the HFH integration to Wiki
  • [x] Add upload and download commands to CLI documentation at readthedocs (or check they get automatically included)

@osma could you take a look at the working principles of the commands before I spend more time with this?

The downloaded files now remain in cache for huggingface client, e.g. (/home/local/jmminkin/.cache/huggingface/hub/models--juhoinkinen--Annif-models-upload-testing), should the cache be cleared when successfully unzipped the datadirs and moved the configs?

Could or should these commands be flagged as experimental for now, to allow changing them?

juhoinkinen avatar Feb 08 '24 11:02 juhoinkinen

I took a look at this (sorry for the delay!) and I think this is a really good feature. Some thoughts:

  1. Agree that these commands should be marked as experimental for now. We may want to change the command options, functionality etc. Maybe in the 1.2 release the experimental status could be dropped.
  2. The commands are uploading/downloading not just the project(s), but also the vocabulary(/ies). Would it make sense to shorten the commands to just upload and download?
  3. Regarding overwriting the vocabulary on download, I think this is a problematic feature and needs some thought. For example if I already have trained models with vocabularies in my data directory, and I download another model from HF Hub, it might come with another version of the same vocabulary and thus rewrite my existing vocabulary and break existing projects. So I agree that there should be some kind of check for this and perhaps an --overwrite or --force option before blindly overwriting existing vocabularies.

I tested this on my local install by running the command annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing; I wanted to see what happens when I already have a local yso vocabulary. The overwrite failed with an exception:

$ annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
Downding project(s): yake-fi
yake-fi.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.63M/2.63M [00:00<00:00, 7.24MB/s]
yake-fi.cfg: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 470kB/s]
yso.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54.6M/54.6M [00:04<00:00, 11.0MB/s]
Traceback (most recent call last):
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/bin/annif", line 6, in <module>
    sys.exit(cli())
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/xxx/git/Annif/annif/cli.py", line 690, in run_download_projects
    cli_util.unzip(vocab_zip_local_cache_path)
  File "/home/xxx/git/Annif/annif/cli_util.py", line 316, in unzip
    zfile.extractall()  # TODO Disallow overwrite
  File "/usr/lib/python3.10/zipfile.py", line 1647, in extractall
    self._extract_member(zipinfo, path, pwd)
  File "/usr/lib/python3.10/zipfile.py", line 1701, in _extract_member
    open(targetpath, "wb") as target:
PermissionError: [Errno 13] Permission denied: '/home/xxx/git/Annif/data/vocabs/yso/subjects.dump.gz'

The downloaded files now remain in cache for huggingface client, e.g. (/home/local/jmminkin/.cache/huggingface/hub/models--juhoinkinen--Annif-models-upload-testing), should the cache be cleared when successfully unzipped the datadirs and moved the configs?

Isn't the caching standard behaviour of HF Hub operations? I think just keeping the cache would be the least surprising thing to do here. Maybe there could be a --no-cache or --no-keep option to control caching? What do other tools with HF Hub integration (e.g. Ludwig) do?

osma avatar Feb 29 '24 08:02 osma

About overwriting vocabularies - I think it would be great if Annif would notice, that the downloaded vocabulary differs from what exists locally. It should be no problem to download two different models that use the same vocabulary:

annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
annif download-projects "yso-fasttext-fi" juhoinkinen/Annif-models-upload-testing
annif download-projects "yso-tfidf-fi" juhoinkinen/Annif-models-upload-testing

But when you have a different version of the vocabulary already present, Annif could show a message such as

Your local version of the "yso" vocabulary differs from the version you are downloading. If you are sure you want to overwrite it, run the command again with the `--overwrite` option. This may break projects that use your current vocabulary.

Even better would be to list the projects that may break.

But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them?

osma avatar Feb 29 '24 08:02 osma

But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them?

Actually the ZipFile objects allow getting ZipInfo of each member of the archive, which in turn contain a CRC-32 checksum (and the name and modification timestamp) of the uncomprossed file. And ZipInfo.from_file() class method allows to get the CRC from the existing, uncompressed files, so the checksum comparisons should be quite easy to implement.

Edit: But CRC of the objects constructed by ZipInfo.from_file() throw AttributeError: 'ZipInfo' object has no attribute 'CRC'. In the compressed archive CRC attribute exists. :confused:

This bug?

juhoinkinen avatar Mar 01 '24 07:03 juhoinkinen

If it's the above bug in zipfile, it's claimed to be

Fixed in 3.9.0

Annif just dropped 3.8 support so maybe we're good? Just need to rebase this branch.

osma avatar Mar 01 '24 08:03 osma

The Python versions 3.9.18, 3.10.13 and 3.11.8 all behaved the same.

This seems strange: the code

zinfo = zipfile.ZipInfo.from_file("data/projects/yake-fi/yake-index")
print(dir(zinfo))
print(zinfo.file_size)
print(zinfo.CRC)

outputs

['CRC', 'FileHeader', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_compresslevel', '_decodeExtra', '_encodeFilenameFlags', '_end_offset', '_raw_time', 'comment', 'compress_size', 'compress_type', 'create_system', 'create_version', 'date_time', 'external_attr', 'extra', 'extract_version', 'file_size', 'filename', 'flag_bits', 'from_file', 'header_offset', 'internal_attr', 'is_dir', 'orig_filename', 'reserved', 'volume']
2627360
Traceback (most recent call last):
  File "/home/local/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/bin/annif", line 6, in <module>
    sys.exit(cli())
             ^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/myusername/git/Annif/annif/cli.py", line 686, in run_download
    cli_util.unzip(project_zip_local_cache_path, force)
  File "/home/local/myusername/git/Annif/annif/cli_util.py", line 325, in unzip
    _is_existing(member)
  File "/home/local/myusername/git/Annif/annif/cli_util.py", line 351, in _is_existing
    print(zinfo.CRC)
          ^^^^^^^^^
AttributeError: 'ZipInfo' object has no attribute 'CRC'

The CRC attribute is created using ZipInfo.__slots__(), but then it is not initialized when running ZipInfo.from_file()?

Also trying to access zinfo.FileHeader() fails similarly, with Traceback ending

...
File "/usr/lib/python3.11/zipfile.py", line 453, in FileHeader
    CRC = self.CRC
          ^^^^^^^^
AttributeError: 'ZipInfo' object has no attribute 'CRC'

Anyway, the CRC-32 checksums for the existing, uncompressed files could be calculated by binascii.crc32().

juhoinkinen avatar Mar 01 '24 09:03 juhoinkinen

I added --force option to overwrite existing local files (project and vocabulary data files and configs), and modified the unzip/config-copy logic so that when the contents are identical (downloaded vs local), there the unzip/config-copy is skipped.

I did not see a reason to restrict the content comparison to only vocabularies(?) (and it was easier to implement both projects and vocabs :slightly_smiling_face:).

Demo with

#!/bin/bash

rm -r data/projects/{fasttext-fi,tfidf-fi,yake-fi}

echo "# Initial download"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo "# Second download with identical content, no-op"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing

echo $RANDOM > data/projects/tfidf-fi/file.txt
echo $RANDOM >> projects.d/tfidf-fi.cfg
echo "# Third download over changed content, skips file overwrites with complains"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing

echo "# Forced third download over changed content"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing --force
tree data

The output:

# Initial download
Downloading project(s): tfidf-fi
# Second download with identical content, no-op
Downloading project(s): tfidf-fi
# Third download over changed content, skips file overwrites with complains
Downloading project(s): tfidf-fi
Not overwriting data/projects/tfidf-fi/file.txt (use --force to override)
Not overwriting projects.d/tfidf-fi.cfg (use --force to override)
# Forced third download over changed content
Downloading project(s): tfidf-fi
data
├── projects
│   └── tfidf-fi
│       └── file.txt
└── vocabs
    └── yso
        ├── subjects.csv
        ├── subjects.dump.gz
        └── subjects.ttl

4 directories, 4 files

juhoinkinen avatar Mar 01 '24 14:03 juhoinkinen

Force-pushed to squash multiple refactoring and renaming commits.

juhoinkinen avatar Mar 12 '24 06:03 juhoinkinen

I noticed that after

annif download yso-bonsai-en juhoinkinen/Annif-models-upload-testing

the modification time of omikuji projects as detected by Annif is wrong; it is shown to be the extraction time:

annif show-project yso-bonsai-en 
...
Modification time: 2024-03-13 11:36:22

This is because of the timetamp of the data/projects/yso-bonsai-en/omikuji-model which is set to the time when the files it contains get extracted.

This PR implements restoration of the timestamps of the extracted files and directories, but this happens zipmember-by-zipmember, and typically the project subdirectories are extracted first (yso-bonsai-en/omikuji-model/).

I think the modification time detection should be changed to disregard subdirectories, and not to try handle this issue in this PR.

juhoinkinen avatar Mar 13 '24 11:03 juhoinkinen

Now I see that the --revision option of the download command is not so useful: each file gets uploaded separately, and so they are included in separate commits with different commit hashes (e.g. for the project config file /yso-mllm-en.cfg vs. the related data zip file projects/yso-mllm-en.zip).

So this happens if I try to use a revision with the commit hash of projects/yso-mllm-en.zip:

annif download yso-mllm-en NatLibFi/FintoAI-data-YSO --revision 0ff6f970347ff671470f7e33ed4ea1a949e02597
Downloading project(s): yso-mllm-en
Error: Operation failed: 404 Client Error. (Request ID: Root=1-65f19f92-22b20ed73b53cacb547eda8c;1c4d5379-92cf-41b6-baa0-51e2b11c837f)

Entry Not Found for url: https://huggingface.co/NatLibFi/FintoAI-data-YSO/resolve/0ff6f970347ff671470f7e33ed4ea1a949e02597/yso-mllm-en.cfg.

Using a branch name as the revision works (--revision main), but now there is no way to set it in the upload command. It is not possible to create a new branch when using HFApi.upload_file.

So at this time I'm in favor of dropping the --revision option from download command. OFC there is a way to create a branch with HFApi.create_branch, but I don't think it is not so needed at this time.

Edit: I already removed --revision option.

Any thoughts @osma?

juhoinkinen avatar Mar 13 '24 13:03 juhoinkinen

I created a Wiki page about the HFH integration: https://github.com/NatLibFi/Annif/wiki/Hugging-Face-Hub-integration.

juhoinkinen avatar Mar 13 '24 13:03 juhoinkinen

I think this is ready for review.

Some points to note and consider:

  • The branch is based on PR https://github.com/NatLibFi/Annif/pull/771 branch to have some dependency conflicts fixed.
  • CodeClimate still complains about complexity. I don't see good ways to resolve the function complexities further. Some code could be moved from cli_util.py to util.py to shorten the file, but what...
  • Test coverage is 100%, but I don't think all functionalities get properly tested.
  • @monalehtinen checked downloading models uploaded by me worked without permission issues.
  • Performance wise, uploading Finto AI model yso-bonsai-en (1.6 GBs) took ~3 min, and yso-mllm-en, yso-bonsai-en and yso-fasttext-en took 7 min (at the office).
  • How in practice the commands should me indicated to be experimental? CLI messages when using them seem too much, so maybe just state it in release notes?
  • Is the syntax annif download <PROJECT_IDS_PATTERN> <REPO_ID> actually good?
    • The order of repository and projects could be the opposite; huggingface-cli download expects first repo-id and then the file(s). However, Annif commands generally use the pattern annif <command> <project-id>, so I chose the current order to be in line with other commands.
    • Instead of project-ids pattern the commands could expect just the project-ids of the projects to operate on.
  • We could somehow help or at least instruct to create a model card for the uploaded models, maybe with a template (there is an official template, but as such it might be a bit too complex)

juhoinkinen avatar Mar 13 '24 13:03 juhoinkinen

Now I see that the --revision option of the download command is not so useful: each file gets uploaded separately, and so they are included in separate commits with different commit hashes (e.g. for the project config file /yso-mllm-en.cfg vs. the related data zip file projects/yso-mllm-en.zip).

So this happens if I try to use a revision with the commit hash of projects/yso-mllm-en.zip:

annif download yso-mllm-en NatLibFi/FintoAI-data-YSO --revision 0ff6f970347ff671470f7e33ed4ea1a949e02597
Downloading project(s): yso-mllm-en
Error: Operation failed: 404 Client Error. (Request ID: Root=1-65f19f92-22b20ed73b53cacb547eda8c;1c4d5379-92cf-41b6-baa0-51e2b11c837f)

Entry Not Found for url: https://huggingface.co/NatLibFi/FintoAI-data-YSO/resolve/0ff6f970347ff671470f7e33ed4ea1a949e02597/yso-mllm-en.cfg.

Using a branch name as the revision works (--revision main), but now there is no way to set it in the upload command. It is not possible to create a new branch when using HFApi.upload_file.

So at this time I'm in favor of dropping the --revision option from download command. OFC there is a way to create a branch with HFApi.create_branch, but I don't think it is not so needed at this time.

Edit: I already removed --revision option.

Any thoughts @osma?

Actually multiple files could be included in one commit using CommitOperationAdd() and create_commit(), but it seems that for this all the included files should exist until running create_commit(), and the zips can be quite large.

Edit: Now when thinking this, the possibility to download a specific "published version" of the models would be very valuable and it is not possible with the current code. The --revision option should be restored, and maybe it would be enough to instruct to use it to target just branches or tags (which I assume works with the current upload_file() approach) and not to target commits.

Edit2: But there is still the question of to how to tag the files. There are methods for tagging, huggingface_hub.Repository.add_tag() and huggingface_hub.HfApi.create_tag (), but tagging is not possible with the huggingface-cli, which would have been the easiest option. And I don't see a way to create tags in Hugging Face Hub website either. So tagging should be implemented in Annif, if it is wanted to be possible to use tags.

juhoinkinen avatar Mar 19 '24 12:03 juhoinkinen

I commented on huggingface-hub repository about git tags, and they promptly created an issue for creating, listing and deleting tags with the huggingface-cli (also wrote a question on HF forum).

If/when it gets implemented, I think it would be most convenient for the case of Annif models if the tagging would be performed after annif upload <projects-pattern> <repo-id> using huggingface-cli, e.g. like huggingface-cli tag <repo-id> <tag-name>.

juhoinkinen avatar Mar 22 '24 12:03 juhoinkinen

Actually multiple files could be included in one commit using CommitOperationAdd() and create_commit(), but it seems that for this all the included files should exist until running create_commit(), and the zips can be quite large.

The preupload_lfs_files() function helped to put all files in one commit but to upload them separately.

When testing with yso-*fi projects of FintoAI (10.6 GBs uncompressed, 8.8 GBs compressed) /usr/bin/time -v reported Maximum resident set size (kbytes): 49616 (and I did not notice much increase in disk space usage); upload took about 20 min at the office.

juhoinkinen avatar Mar 27 '24 09:03 juhoinkinen

While uploading projects to FintoAI-data-YSO with annif upload "*" NatLibFi/FintoAI-data-YSO , this error occurred on two tries:

...
  File "/home/local/jmminkin/git/Annif/annif/cli_util.py", line 259, in prepare_datadir_commit
    fobj = _archive_dir(data_dir)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/jmminkin/git/Annif/annif/cli_util.py", line 285, in _archive_dir
    with zipfile.ZipFile(fp, mode="w") as zfile:
  File "/usr/lib/python3.11/zipfile.py", line 1355, in __exit__
    self.close()
  File "/usr/lib/python3.11/zipfile.py", line 1911, in close
    self.fp.seek(self.start_dir)
OSError: [Errno 28] No space left on device

However I had about 60 GBs of free disk space. And preuploading should free memory of the uploaded objects, so it should not be memory issue either... Anyway, this could be circumvented by uploading only projects of one language by one command.

juhoinkinen avatar Apr 10 '24 12:04 juhoinkinen

:tada:

osma avatar Apr 23 '24 07:04 osma