Annif icon indicating copy to clipboard operation
Annif copied to clipboard

BERT Backend

Open lunactic opened this issue 2 years ago • 2 comments

Hello

I am currently working at Swiss National Library experimenting with Annif for the automatic generation of Dewey numbers. In that process I started experimenting with BERT approaches as explained here: https://www.sbert.net/examples/applications/semantic-search/README.html#semantic-search

First tests indicate that this approach could work very well. Would this be interesting to the whole Annif community. If yes then I could check if I'll find the time for creating a PR that implements this as a backend for Annif.

The approach that I would follow is to to create the embeddings for the training corpus when annif train is used and store them as a pickle file for use later. The methodology would also allow for "retraining", meaning embeddings of new documents could simply be appended to the existing training corpus.

lunactic avatar Sep 26 '22 08:09 lunactic

Hello @lunactic , thank you for the suggestion!

There is already some work being done to integrate Annif with language models, mainly by integrating the XTransformer model from PECOS in PR #540 by @mo-fu . But what you propose seems somewhat different.

The idea of semantic search is not new but actually implemented already in the simplest Annif backend, tfidf. But of course it doesn't use a language model, it just converts the text from the training documents (aggregated by subjects, so e.g. all text related to the subject "cars" is concatenated into a single virtual "document" representing the subject "cars") into tf-idf vector space and then at suggest time, the input is also converted into a similar vector and the nearest neighbors (subjects) returned. Conceptually what you propose seems similar, except that instead of simple tf-idf vectors, you would use embeddings from BERT or some other language model.

Do you have any idea of how accurate this kind of model could be, for example for Dewey classification? Did you compare it with other approaches? I've had quite good results on DDC classification with SVC and Omikuji Bonsai, which both achieve pretty similar accuracies. If your approach (which would undoubtedly be way more resource-intensive) would be more accurate than this "baseline", that would be interesting and support the idea of integrating it with Annif.

As I understand it, XTransformer is specifically tailored for extreme multi-label classification problems, which are typically very challenging because of large vocabularies (many classes/labels), big training corpora with skewed distributions etc. You may want to look at that as well - the PR is already usable and the documentation for how to use can be found in the comments on GitHub.

osma avatar Sep 26 '22 09:09 osma

Just wanted to add some reading material for semantic search on dense word vectors:

  • https://arxiv.org/abs/1908.10084 General Approach for simmillarity on word vectors
  • https://arxiv.org/abs/2109.04404 Investigates what Layer of the network to use as representation and how to diminish curse of dimensionality.
  • https://few-shot-text-classification.fastforwardlabs.com/ Hands on description on alignment of different embedding techniques (e.g. one for vocabulary terms and one for documents) The details of the technique can be retrieved from the notebooks provided.

As mentioned by @osma this does not yet handle the label distribution issue of XML problems. But can probably combined with the clustering techniques in Parabel/Bonsai. The Omikuji library even has the option to only learn the label tree.

mo-fu avatar Oct 07 '22 09:10 mo-fu