This PR adds xtransformer as an optional dependency, incorporating minor changes and updating the backend implementation to align with the latest Annif version, building on the previous xtransformer PR #540

Sep 16 '24 15:09 Lakshmi-bashyam

Codecov Report

Attention: Patch coverage is 30.68182% with 183 lines in your changes missing coverage. Please review.

Project coverage is 97.25%. Comparing base (6bae2e5) to head (0e9ad2c).

Files with missing lines	Patch %	Lines
tests/test_backend_xtransformer.py	9.27%	88 Missing :warning:
annif/backend/xtransformer.py	8.42%	87 Missing :warning:
annif/backend/__init__.py	16.66%	5 Missing :warning:
tests/test_backend.py	40.00%	3 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #798      +/-   ##
==========================================
- Coverage   99.64%   97.25%   -2.40%     
==========================================
  Files          99      101       +2     
  Lines        7349     7606     +257     
==========================================
+ Hits         7323     7397      +74     
- Misses         26      209     +183

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sep 17 '24 06:09 codecov[bot]

Thanks a lot for this new PR @Lakshmi-bashyam ! It really helps to have a clean starting point based on the current code.

We've now tested this briefly. We used the PLC (YKL) classification task, because it seemed simpler than predicting YSO subjects and the current classification quality (mainly using Omikuji Parabel and Bonsai) are not that good, so it seems likely that a new algorithm could achieve better results. (And it did!)

I set this up in the University of Helsinki HPC environment. We got access to an A100 GPU (which is way overkill for this...) so it was possible to train and evaluate models in a reasonable time.

Here are some notes, comments and observations:

Default BERT model missing

Training a model without setting model_shortcut didn't work for me. Apparently the model distilbert-base-multilingual-uncased cannot be found on HuggingFace Hub (maybe it has been deleted?). I set model_shortcut="distilbert-base-multilingual-cased" and it started working. (Later I changed to another BERT model, see below)

Documentation and advice

There was some advice and a suggested config in this comment from Moritz. I think we would need something like this to guide users (including us at NLF!) on how to use the backend and what configuration settings to use. Eventually this could be a wiki page for the backend like the others we have already, but for now just a comment in this PR would be helpful for testing.

Here is the config I currently use for the YKL classification task in Finnish:

[ykl-xtransformer-fi]
name="YKL XTransformer Finnish"
language="fi"
backend="xtransformer"
analyzer="simplemma(fi)"
vocab="ykl"
batch_size=16
truncate_length=256
learning_rate=0.0001
num_train_epochs=3
max_leaf_size=18000
model_shortcut="TurkuNLP/bert-base-finnish-cased-v1"

Using the Finnish BERT model improved results a bit compared to the multilingual BERT model. It's a little slower and takes slightly more VRAM (7GB instead of 6GB in this task), probably because it's not a DistilBERT model.

This configuration achieves a Precision@1 score of 0.59 on the Finnish YKL classification task, which is slightly higher than what we get with Parabel and Bonsai (0.56-0.57).

If you have any insight in how to choose appropriate configuration settings based on e.g. the training data size, vocabulary size, task type, available hardware etc. then that would be very valuable to include in the documentation. Pecos has tons of hyperparameters!

Example questions that I wonder about:

Does the analyzer setting affect what the BERT model sees? I don't think so?
How to select the number of epochs? (so far I've tried 1, 2 and 3 and got the best results with 3 epochs)
How to set truncate_length and what is the maximum value? Can I increase it from 256 if my documents are longer than this?
How to set max_leaf_size?
How to set batch_size?
Are there other important settings/hyperparameters that could be tuned for better results?

Pecos FutureWarning

I saw this warning a lot:

/home/xxx/.cache/pypoetry/virtualenvs/annif-fDHejL2r-py3.10/lib/python3.10/site-packages/pecos/xmc/xtransformer/matcher.py:411: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

However, I think this is a problem in Pecos and probably not something we can easily fix ourselves. Maybe it will be fixed in a later release of Pecos. (I used libpecos 1.25 which is currently the most recent release on PyPI)

Not working under Python 3.11

I first tried Python 3.11, but it seemed that there was no libpecos wheel for this Python version available on PyPI (and it couldn't be built automatically for some reason). So I switched to Python 3.10 for my tests. Again, this is really a problem with libpecos and not with the backend itself.

Unit tests not run under CI

The current tests seem to do a lot of mocking to avoid actually training models. This is probably sensible since actually training a model could require lots of resources. However, the end result is that test coverage is quite low, with less than 10% of lines covered.

Looking more closely, it seems like most of the tests aren't currently executed at all under GitHub Actions CI. I suspect this is because this is an optional dependency and it's not installed at all under the CI environment, so the tests will be skipped. Fixing this in the CI config (.github/workflows/cicd.html) should at least substantially improve the test coverage.

Code style and QA issues

There are some complaints from QA tools about the current code. These should be easy to fix. Not super urgent, but they should be fixed before we can consider merging this. (If some things are hard to fix we can reconsider them case by case)

Lint with Black fails in the CI run. The code doesn't follow Black style. Easy to fix by running black
SonarCloud complains about a few variable names and return types
github-advanced-security complains about imports (see previous comment above)

Dependency on PyTorch

Installing this optional dependency brings in a lot of dependencies, including PyTorch and CUDA. The virtualenv in my case (using poetry install --all-extras) is 5.7GB, while another one for the main branch (without pecos) is 2.6GB, an increase of over 3GB. I wonder if there is any way to reduce this? Especially if we want to include this in the Docker images, the huge size could become a problem.

Also, the NN ensemble backend is implemented using TensorFlow. It seems a bit wasteful to depend on both TensorFlow and PyTorch. Do you think it would make sense to try to reimplement the NN ensemble in PyTorch? This way we could at least drop the dependency on TensorFlow.

Again, thanks a lot for this and apologies for the long silence and the long comments! We can of course do some of the remaining work to get this integrated and merged on our side, because this seems like a very useful addition to the Annif backends. Even if you don't have any time to work on the code, just providing some advice on the configuration side would help a lot! For example, example configurations you've used at ZBW would be nice to see.

Sep 25 '24 10:09 osma

Quality Gate failed

Failed conditions
11.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

Sep 25 '24 10:09 sonarqubecloud[bot]

Especially if we want to include this in the Docker images, the huge size could become a problem.

I build a Dockerimage from this branch, and its size is 7.21 GB, which is quite much bigger than the size of Annif 1.1 image, which is 2.07 GB.

Not all users and use cases probably won't need Xtransformer, or other optional dependencies, so we could build different variants of the image and push them to quay.io (just by setting different buildargs in GitHub Actions build step and tagging the images appropriately). But that can be done in separate PR; I'll create an issue for this now.

Sep 26 '24 07:09 juhoinkinen

Hello, Thank you for your work on this PR! At the German National Library, we are also experimenting with XR-Transformer. We would be glad to contribute, especially with regards to documentation and training advice.

A good starting point might be the hyperparameters used in the original paper. They can be found here. Different settings were used for different datasets.

We also observed that the choice of Transformer model can have an impact on the results. In the original paper and in our experiments, Roberta model performed well. We used xml-roberta-base. It is a multilingual model which was trained on 100 languages.

Are there other important settings/hyperparameters that could be tuned for better results?

We found that tuning the hyperparameters associated with the Partitioned Label Tree (known as Indexer in XR-Transformer) and the hyperparameters of the OVA classifiers (known as Ranker in XR-Transformer) led to notable improvements in our results. In particular:

nr_splits (& min_codes): Number of child nodes. This hyperparameter can be compared to cluster_k in Omikuji. For us, bigger values like 256 led to better results.
max_leaf_size: We observed that bigger values perform better. We currently use 400.
Cp & Cn are the costs for wrongly classified labels used in the OVA classifiers. Cp is the cost for wrongly classified positive labels, Cn is the cost for negative labels. Using different penalities for positive and negative labels is especially helpful when labels are imbalanced, which is probably the case for OVA classifiers. These hyperparameters had a huge influence on our results. Further reading
threshold: A regularisation method. Model weights in the OVA classifiers that fall below the threshold are set to zero. Choosing a high value here will reduce model size, but might lead to a model that is underfitting. Choosing a very low value might lead to overfitting. We achieve good performance with 0.015.

As far as I can tell, some of these are not currently integrated in the PR here.

How to set truncate_length and what is the maximum value? Can I increase it from 256 if my documents are longer than this?

The maximum length of the transformer model limits this. For instance, for BERT this is 512. The authors noted that there was no significant performance increase when using 512, and we observed the same thing.

How to set batch_size?

This also depends on how big of a batch fits the memory requirements of GPUs/CPUs that is used. Generally, starting out with a value like 32 or 64 works well, increasing it (if possible) to see if this leads to improvements. I also found this forum exchange where it's stated that:

Batch size is a slider on the learning process. Small values give a learning process that converges quickly at the cost of noise in the training process. Large values give a learning process that converges slowly with accurate estimates of the error gradient.

I have attached the hyperparamter configuration file that we currently use. Even though we don't use Annif in our experiments, I hope this can still provide some helpful insights. params.txt

I am happy to answer any questions and contribute to the Wiki if needed!

Oct 01 '24 12:10 katjakon

Validation Data during Training

I've been testing this Annif version with XTransformer and so far it's working pretty well. Thanks again! However, I noticed that no validation data is used during training. I think validation is crucial for XTransformer to avoid overfitting and to save only the best performing model checkpoints. Is there any way to include a validation file, especially when using the annif train command? I would appreciate any comments or hints!

Nov 20 '24 12:11 katjakon

Thank you very much @katjakon for your very insightful comments!

Nov 21 '24 08:11 osma

I have just discussed the options about integrating validation data into the backend with @katjakon. I agree with Katja, that avoiding overfitting in the training process is crucial. We see two options:

a) add another argument to annif train, to allow passing on a separate validation-dataset by the user
b) implement a splitting procedure as part of the backend My colleagues that practically operate annif at DNB usually have a validate-split in their data management. So I think at DNB option a) would be feasable. Any opinions on this?

Nov 27 '24 12:11 mfakaehler

Thanks for your insight @mfakaehler and @katjakon ! I agree that making it possible to provide a separate validation data set during XTransformer training makes sense. But the CLI would have to accommodate this.

Already the annif train command can take any number of paths (so you can pass multiple train files/directories), so adding another positional argument isn't easy. However, there could be a new option such as --validate that could be given a path to a validation data set (it could even be repeated, I think, if there's a need for passing multiple paths). So the train command could look like this:

annif train my-xtransformer --validate validate.tsv.gz train1.tsv.gz train2.tsv.gz train3.tsv.gz

Then the question becomes: should --validate be a required parameter when training XTransformer? Or would the backend in that case perform the split on its own? (perhaps defaulting to e.g. 10% for validation, with another option to override the fraction).

Nov 27 '24 13:11 osma

A default logic like

user provides validate Argument
if not: user provides splitting fraction
if not: splitting fraction is set to 10%

as you suggested, seems plausible to me! Let's wait until @Lakshmi-bashyam returns and see if there is an argument for the usecase of no validation data. Maybe there is also a need for that, too.

Nov 28 '24 06:11 mfakaehler

We used Xtransformer for the LLMs4Subjects task, and it worked quite well.

For English documents Xtransformer gave F1@5 score of 0.3091, while Omikuji Bonsai and MLLM gave respectively 0.3234 and 0.2281; when Xtransformer was added to an ensemble including already Omikuji Bonsai and MLLM, ensemble's F1@5 score was increased by 1.0% points to 0.3412.

For German documents the results were similar.

I made minor modifications to the code in the branch xtransformer-natlibfi, including the ability to set more hyperparameters via Annif's project configuration.

Some observations:

Running the models without a GPU works for inference, seems like GPU is not used for it (although giving a GPU for the job on Turso does speed up suggest requests, see below):
- GPU usage by nvidia-smi tool shows Annif's Python interpreter to use some GPU memory, but GPU-Utilization is 0%.
- When timing suggest API requests on different platforms for 50 documents (JYX theses) I got these average times per document (four runs over the documents):
  - On the regular server used for training Annif projects (with 6 CPUs):
    - Run averages 1.59, 1.46, 1.48, 1.47 seconds -> total average 1.50 seconds
  - On Turso GPU partition with 6 CPUs (but not provisioning a GPU):
    - Run averages 1.82, 1.90, 1.89, 1.83 seconds -> total average 1.86 seconds
  - On Turso GPU partition with 6 CPUs and A100 GPU:
    - Run averages 1.61, 1.52, 1.56, 1.56 seconds -> total average 1.56 seconds
Adding is_predict_only=True for loading the Xtransformer model seemed to constantly give about 5% improvement in performance.
About using Transformer models:
- Sadly ModernBERT models cannot be used with Pecos: the supported models are hardcoded here.
- When I try to use HPLT/hplt_bert_base_fi model I get an error message instructing to set trust_remote_code=True. That parameter should be set in Pecos to various from_pretrained() methods, e.g. in here.
- When loading a model Torch gives a warning about a deprecation: FutureWarning: You are using torch.load with weights_only=False (the current default value)....
Logging:
- When training a project every log message is duplicated.
- When Annif is run for serving its API on a platform without a GPU, a warning is logged for every suggest request: CUDA is not available, will fall back to CPU. I think the warning could be turned off by using use_gpu=False parameter.
I noticed Pecos includes a built-in vectorizer, which their material says to be quite fast (faster than Sklearn's vectorizer), maybe that could be added to Annif.

Feb 20 '25 12:02 juhoinkinen

Thank you for the constructive discussion and your input on this PR. I appreciate your engagement and collaboration.

I will be going on maternity leave soon and will be back in july. I plan to continue this work upon my return. I apologize for the delay and appreciate your understanding regarding the timeline. I'm looking forward to picking this up and moving it forward once I'm back.

Apr 28 '25 06:04 Lakshmi-bashyam

Thanks for your insight @mfakaehler and @katjakon ! I agree that making it possible to provide a separate validation data set during XTransformer training makes sense. But the CLI would have to accommodate this.

Already the annif train command can take any number of paths (so you can pass multiple train files/directories), so adding another positional argument isn't easy. However, there could be a new option such as --validate that could be given a path to a validation data set (it could even be repeated, I think, if there's a need for passing multiple paths). So the train command could look like this:
annif train my-xtransformer --validate validate.tsv.gz train1.tsv.gz train2.tsv.gz train3.tsv.gz
Then the question becomes: should --validate be a required parameter when training XTransformer? Or would the backend in that case perform the split on its own? (perhaps defaulting to e.g. 10% for validation, with another option to override the fraction).

I agree with the idea. I'm already using validation data, and I can incorporate this change as well.

Currently, I perform the train/validation split outside of Annif. We could move the splitting inside Annif, but we would need to ensure that the label distribution remains intact during the split. As far as I know, the train/test split function provided by scikit-learn does not support this natively.

Additionally, should we consider isolating the validation feature to the xtransformer backend for now?

Apr 28 '25 06:04 Lakshmi-bashyam

A good starting point might be the hyperparameters used in the original paper. They can be found here. Different settings were used for different datasets.

Regarding the parameter configuration: as @katjakon mentioned, a good starting point is to identify the dataset or vocabulary from the examples folder that is closest to the one we are working with. For STW, the Eurelex dataset has a similar label set size. I reused the parameters from that dataset where appropriate, depending on the model.

Additionally, I found that adjusting the C_n, C_p, and clustering parameters helped improve model performance. I plan to provide a detailed report on the hyperparameter optimization process and the interaction between parameters at a later stage.

At the moment, our GPU is unfortunately down, which may cause some delay

Apr 28 '25 06:04 Lakshmi-bashyam

Quality Gate failed

Failed conditions
1 Security Hotspot
13.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Jul 10 '25 12:07 sonarqubecloud[bot]

Draft for X-Transformer Wiki Page

I wrote a first draft of a potential Wiki page about X-Transformer, which includes hyperparameters and notes about optimization. This can definitely be extended and modified as this PR evolves. Let me know if you have any notes! Backend-X-Transformer.md

Aug 06 '25 09:08 katjakon

There were a few changes made just before the Annif 1.4 release that unfortunately caused some conflicts with this PR in annif/util.py (just some import statements) and the dependencies declared in pyproject.toml. These need to be resolved.

In addition, as part of PR #864 the API for some AnnifBackend methods changed a little; documents are no longer passed as text strings but as Document objects. Those changes need to be applied to this backend as well; see the commit b0bb1632936f674d5400e9f94a33de6a048451e1 where the changes were made for other backends.

We would very much like to include this backend in the next minor release Annif 1.5. However, that will take some work elsewhere in the codebase; I think it would make sense to reimplement the NN ensemble to use Pytorch instead of TensorFlow so that we don't have to depend on two very similar and possibly conflicting ML libraries.

Sep 16 '25 13:09 osma

The PECOS TF-IDF vectorizer has a significant limitation: it does not allow the use of custom tokenizers. Instead, it relies exclusively on its built-in tokenizer.

Technical Details

The core of this limitation lies in the PecosTfidfVectorizerMixin, which utilizes the PECOS.Vectorizer.train() method.
This method is restricted to a predefined set of parameters such as: ngram_range,max_df_ratio, analyzer,min_df_cnt

It lacks a mechanism to accept custom tokenizer functions.

Despite this limitation, performance tests on the ZBW dataset showed that vectorization is 5× faster compared to the standard TF-IDF method.
This performance improvement becomes more pronounced with larger datasets, making it an attractive option for large-scale applications.

we gain a substantial performance improvement (5× speedup) at the cost of losing the flexibility to customize tokenization.

Current Implementation

I have implemented the PECOS TF-IDF vectorizer and included it for XTransformer only for now.
Additionally, I have addressed the document object handling in the suggest method.

Sep 16 '25 14:09 Lakshmi-bashyam

@Lakshmi-bashyam Thanks a lot for the changes, and for the information about the vectorizer. I guess we will have to live with its limitations at least for now. At least it is fast!

I see that you fixed some of the recent merge conflicts, but apparently pyproject.toml is still in a conflict state according to GitHub. Can you take a look?

I opened a new issue about reimplementing the NN ensemble backend using Pytorch: #895

Sep 16 '25 14:09 osma

Dear @Lakshmi-bashyam and @osma, we came across this issue with the TFIDF-Vecotrizer in PECOS, too. We can confirm that at least for German it worked reasonably well, in the sense that X-Transformer gives good overall results. So I agree that this is a limitation that one could probably live with.

Sep 17 '25 12:09 mfakaehler

Another topic that I would like to raise is that of dependencies (and I hate to bring it up!). At DNB we are currently developing an Annif-Backend that uses Embedding based matching. @RietdorfC will give an update on this soon. I have included the dependency specs that we currently aim for here [1] In particular, this involves transformers v4.52.4. Pecos comes with 'transformers>=4.31.0 if I have correctly spotted that. So in theory, that should be compatible. However, one should probably make sure pecos actually works with an up-to-date version of transformers. I am not sure, whether PECOS has kept track on all breaking changes. See also this pull request to pecos for our attempts to make an update to pecos to support more modern model architectures. This suggests, that pecos might not be compatible with newer versions of transformers... So we maybe running into a problem here. Could you confirm this @Lakshmi-bashyam?

[1] ebm_packages.txt

Sep 17 '25 13:09 mfakaehler

@mfakaehler You’re correct — PECOS hasn’t yet been updated to work with the latest Transformers versions. I’ve also opened an issue with the PECOS team about this: https://github.com/amzn/pecos/issues/311

Currently, PECOS ≥ 1.2.7 can only be used with the constraint transformers<=4.49.0.

There’s also another dependency conflict: Python 3.11 is supported starting from PECOS ≥ 1.2.7, but those versions require scipy<1.14.0, while Annif requires scipy>=1.15.3.

Sep 17 '25 13:09 Lakshmi-bashyam

Thanks for the clarification @Lakshmi-bashyam. I am sorry to say, that its not obvious to me what to do about this :(

Sep 18 '25 05:09 mfakaehler

What this all boils down to is that it looks like PECOS is not being very actively maintained and relies on versions of libraries that are about to become obsolete. This is a problem if we want to integrate it with Annif (as in this PR), even as an optional dependency, because of the way PECOS sets upper limits on the versions of important library dependencies. While we could try to adjust every other component to accommodate PECOS, if it's even possible to do so, this would only work for a limited time if PECOS stays as it is. The ecosystem always moves on: new versions of libraries are released (possibly with security fixes!), new Python releases will come with new demands on libraries etc.

So unfortunately I don't see any other way of moving forward than trying to work with the PECOS project on bringing the dependencies up to date on their side. In the worst case, this might mean forking it (or at least important parts of it) and taking over maintenance.

Sep 18 '25 07:09 osma

@osma Yeah, I’m with you on this. For now I’ll see if I can update the dependencies without conflicts and send a PR over to the PECOS team.

On the Annif side, at least for this PR, I’ll just downgrade the dependencies temporarily until the PECOS team sort this out.

Sep 18 '25 11:09 Lakshmi-bashyam

Quality Gate failed

Failed conditions
11.1% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Oct 07 '25 15:10 sonarqubecloud[bot]

Add Xtransformer to backend

Codecov Report

Default BERT model missing

Documentation and advice

Pecos FutureWarning

Not working under Python 3.11

Unit tests not run under CI

Code style and QA issues

Dependency on PyTorch

Quality Gate failed

Validation Data during Training

Quality Gate failed

Draft for X-Transformer Wiki Page

Technical Details

Current Implementation

Quality Gate failed