Automatically add metadata to Hugging Face Hub repos when uploading projects
With this PR, when running annif upload:
- if
README.md(Model Card) does not exist in the destination repository, thenREADME.mdis created with default contents and some metadata of the uploaded projects, - if
README.mdexists, its metadata are updated as necessary.
Closes #790.
The metadata includes these:
language:
- <language-code tags automatically obtained from the uploaded projects>
tags:
- annif # custom tag
pipeline_tag: text-classification # HFH tag
The Model Card text content is very minimal; it has just the repo name as the heading and info about how to download projects from the repo, see an example in https://huggingface.co/juhoinkinen/Annif-models-upload-testing.
About @osma's suggestions in https://github.com/NatLibFi/Annif/issues/790#issuecomment-2137376118:
For example it could include the Annif version used for training, the backend, vocabulary name and size, possibly some of the hyperparameters / configuration settings as well.
- Annif version:
- The Annif version used for training is not stored anywhere at the moment; the version performing the upload is not necessarily the same. This kind of metadata should be first stored somewhere, for which there is the issue https://github.com/NatLibFi/Annif/issues/329
- Backend, vocabulary name and other project configuration:
- These are available in the
<project-id>.cfgfiles, accessible from the Files and versions tab, e.g. https://huggingface.co/NatLibFi/FintoAI-data-YSO/blob/main/yso-en.cfg, so I think they are not worth putting to the Model Card.
- These are available in the
Quality Gate passed
Issues
6 New issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.65%. Comparing base (
3b5f7a1) to head (e4febab). Report is 51 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #793 +/- ##
==========================================
+ Coverage 99.64% 99.65% +0.01%
==========================================
Files 91 93 +2
Lines 6817 7058 +241
==========================================
+ Hits 6793 7034 +241
Misses 24 24
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@CodiumAI-Agent /review
PR Reviewer Guide 🔍
(Review updated until commit https://github.com/NatLibFi/Annif/commit/845f53d74fee07c94b7f97be5dbd73550eb4ef58)
| ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪ |
| 🧪 PR contains tests |
| 🔒 No security concerns identified |
| ⚡ Key issues to review Error Handling Configuration Error Handling |
Possible Bug: Ensure that the upsert_modelcard function handles cases where project language data might be missing or malformed. > The current implementation assumes that proj.vocab_lang is always available and valid.
Good point by the AI, but I think the project language is always set if this point is reached...?
Persistent review updated to latest commit https://github.com/NatLibFi/Annif/commit/845f53d74fee07c94b7f97be5dbd73550eb4ef58
I added an automatically updating Projects section to the modelcard, like this: https://huggingface.co/juhoinkinen/Annif-models-upload-testing#projects
Quality Gate passed
Issues
4 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code