Annif icon indicating copy to clipboard operation
Annif copied to clipboard

Development of an Embedding-based matching backend

Open mfakaehler opened this issue 7 months ago • 7 comments

Dear Annif-Team,

at the DNB we (@RietdorfC and me) are working on the transfer of our findings from our AI-Project to a more productive state. In our project we have developed a prototype for an embedding based matching procedure. We are now looking into how to implement this as a backend for Annif. We have therefore created a fork, where we will be working on, for the purpose of implementing the new backend. Once we have a matured development version, we will open a merge request and kindly ask for your feedback.

So far, this is just an announcement, to make you aware of our activity. Stay tuned!

Best, Maximilian

mfakaehler avatar May 20 '25 11:05 mfakaehler

Wow, this is great news, and your code repository looks really promising!

Looking forward to hearing more once you get started with the Annif integration.

A few quick questions:

  1. I assume you've benchmarked this approach - how does it compare for example to MLLM, or how much does it contribute when included in an ensemble with other backends such as Omikuji?
  2. Does the method require (in practical use) a GPU during training? At inference time? I assume that the GPU is used by the external Weaviate service and not the Annif-included code directly?
  3. Are you planning to create a new independent Python library (similar to stwfsapy) that Annif could (perhaps optionally) depend on, or are you going to include the full algorithm within the Annif codebase (like MLLM)?

osma avatar May 20 '25 12:05 osma

Hi Osma,

thanks for engaging this discussion so promptly. We have thought about these questions ourselves:

  1. Yes. we have performed some benchmarking. in our ensemble of omikuji and mllm it could contribute roughly 2% gain to f1-score. Its benefits are primarily in the long-tail, so on its own it will not outcompete omikuji. Due to the embedding apporach the results are a bit different from those produced by mllm, but not completely disjoint. We think of the ebm approach as a complement to the other approaches, not designed to function well on its own. If time permits, I will try to undermine these claims with some data later on.
  2. We are about to reconfigure our prototype to work with an in-file vector database (duckdb). This will get rid of weaviate. The backend will (hugely) benefit from GPU-support, but to generate embeddings it is not strictly mandatory. During training I would defenitely reccomend using GPUs, but at inference time I hope, at least for short text applications, a gpu might not be necessary.
  3. Currently we are aiming at including the full algorithm into the Annif codebase, like MLLM and closely following MLLM's implementation.

Best, Max

mfakaehler avatar May 20 '25 13:05 mfakaehler

Hi Osma, I think we could also create an independent library that Annif could use as an (optional) dependency. Is there an option you prefer? Best regards Clemens

RietdorfC avatar May 20 '25 13:05 RietdorfC

Hi @RietdorfC, now that I think of it, I think we would prefer an independent library that Annif could use as an optional dependency. Especially if, as is likely, your backend will have several new-to-Annif dependencies, even large ones such as PyTorch. This way, we wouldn't have to review and take responsibility for all the new code, as most of it would be in a library maintained and published by you (DNB); only the backend "glue" code would be in the Annif codebase. This kind of arrangement has worked quite well with the stwfsa backend; ZBW is maintaining the stwfsapy library, and Annif uses it (through the stwfsa backend) as an optional dependency.

Probably even MLLM should have been done in this way, so it could be usable outside Annif. But at the time I was implementing it, I thought the simplest way to do it was to include it in the main Annif codebase. But it's quite a lot of code compared to all the other Annif backends, and ideally that code could live in its own repository.

osma avatar May 20 '25 16:05 osma

Dear Annif-Team, Hi @osma,

I would like to give you a brief report on the current status of our project. We now developed a fully functional version of our embedding based matching process in an independet library. You can find our Git repository here and the PyPI side of our project here.

Furthermore, we have developed an ebm backend for Annif, which you can find in our fork of Annif here. This version is a first prototype and some changes might still be necessary e.g., the values of the default parameters are not yet set finally and tests for the backend are still missing. We have performed some small tests and annif train and annif suggest are seem to be running fine. We are about to test the entire package in the Annif context and calculate some metrics. We will report back to you once we have completed the tests.

If you have any remarks about our project or the backend, please let us know.

Best regards

RietdorfC avatar Oct 08 '25 11:10 RietdorfC

Thanks @RietdorfC , this is great!

Please open a pull request (maybe draft PR, since you didn't have tests yet) from your Annif fork as soon as you feel ready to do so. This would also make it easier for us to comment on implementation details on the Annif backend side.

I'm also curious about the evaluation results, eager to hear more on that :)

Does using the backend require a GPU or is it possible to run it without one (even if it's slower)?

Also I'm wondering about the dependencies of ebm4subjects - are they all strictly necessary? For example pandas is a pretty large library IIRC and I wonder if it's really required - usually it's used for interactive data exploration.

osma avatar Oct 08 '25 13:10 osma

Hi Osma, thanks for your comments. Regarding the GPU question: for short texts it should be feasible to use the backend without GPU. So it is not a strict requirement.

The process has to compute embeddings, sentence wise, for you entire training corpus. E.g. with 10.000 training docs and 100 sentences each, you may run into 1M embeddings that need to be computed for training the backend, where we would be in the domain of "GPU recommended". Still, it would be possible to do without GPU (if running the process for 24h is acceptable).

We will look into the dependency problem. We make strong usage of polars in essential parts of the package. Maybe we can leave pandas aside and switch to polars entirely.

mfakaehler avatar Oct 08 '25 14:10 mfakaehler