Annif LLM ranking/scoring backend

Following the success by DNB-AI-Project (@mfakaehler & Lisa Kluge) at LLMs4Subjects using LLM to rank candidate subjects, we have been experimenting with a similar approach incorporated as an Annif backend. ~The code currently is in the branch experiment-llm-ensemble-backend.~ Edit: See the PR #859.

The name should maybe not be "LLM ensemble" as it is now, because the functionality does not necessitate multiple source projects: LLM could very well just score the subjects given by just one source project (which could be a simple ensemble). Possible name alternatives: "LLM ranker", "LLM scorer", "LLM grader".

The chosen name should allow the possible implementation of also other backends with similar ranking/scoring functionalities.

May 21 '25 07:05 juhoinkinen

For Finto AI there was an idea to somehow tweak the suggestions to be more creative. With LLM the tweaking could be implemented via tuning the prompt, maybe even to an arbitrary direction decided by the user ("emphasize keywords that are {USER_GIVEN_EMPHASIS}").

May 21 '25 11:05 juhoinkinen

Hi Juho,

great to see that our work has inspired you to implement this new ensemble backend. As you asked for the appropriate name: We got the idea for the reranker at the end of our pipeline from Karel D'Oosterlinck's work on Infer-Retrieve-Rank. They saw this module as a ranker. Maybe thats an argument towards giving the name LLM ranker. However, I understand that in the logic of Annif the backend would have the role of an ensemble-backend.

It will take me some time, but I'm looking forward to look at your code in detail and give suggestions (if any). Thanks

May 21 '25 12:05 mfakaehler

My two cents on naming:

Technically, and in the current Annif logic, the LLM ranking backend is an ensemble, like Max said. But like Juho said, this implicitly suggests that it should be used with multiple source projects, even if it's actually OK to apply it with just one source.

Even the currently implemented Annif ensembles don't actually require more than one source. Using the plain ensemble with one source would be a bit silly (it won't change anything), but the PAV and NN ensembles actually can be useful with just one source, for example correcting some of the bias of a lexical model such as MLLM.

That said, the ranking term appears to be established (thanks Max for the references!), so I think a simple name such as llm_rank would be most appropriate.

Another similar ranking backend idea would be to use BERT-style cross encoders for (re)ranking, as was done by the TartuNLP team in the first LLMs4Subjects challenge. That would be worth creating its own issue for; but if we limit the discussion here to naming, I think a name such as bert_rank, encoder_rank or xenc_rank could work.

May 22 '25 08:05 osma

Currently the prompt describing the task of LLM is hard-coded in the backend, but it could be made overridable with a prompt given as txt file e.g. in the data dir of the project.

This would allow users to use a prompt in a language matching the documents language (to try to improve results and possibly use a smaller LLM of a specific language) and also to tweak the prompt at will.

May 30 '25 08:05 juhoinkinen