Flexible fusion backend
With Annif, it is possible to use several specialised models for prediction in an ensemble. However, all models in an Annif ensemble, can only be given one specific single kind of text for prediction, so it is not possible to pass on different kinds of text to each single model. The only way to adapt the text for prediction currently is to use the transform parameter. We can make use of that parameter to read either a limited amount of characters from the beginning or read all of the text. A parameter that would enable us to set a specific range (from character x to character y) to be read from the text would give us additional way to specify/cut down the text for specific models.
Could the annif ensemble functionality be extended in such a way that the individual models of an ensemble could be given different kinds of text (expressions of a document) for processing?
Another flexibility for the ensemble functionality would be the use of subsets of vocabularies for individual models in ensembles as discussed in issue #596 .
Anyway, the interface for the predictions need enhancments. In the following we describe some ideas we already discussed a few weeks ago:
-
Allow to submit text as structured data: Use json:
-d '%7B%22headline%22%3A%22Wonderful%22%2C%22fulltext%22%3A%22Oh%2C%20what%20a%20wonderful%20world%22%7D'or Use xml tags:-d '%3Cheadline%3EWonderful%3C%2Fheadline%3E%3Cfulltext%3EOh%2C%20what%20a%20wonderful%20world%3C%2Ftext%3E'... with the possibility to define the tags at the right places in the projects.conf, like:submitted_text=headline,' ',fulltextThe empty space between the "headline" and the "fulltext" defines the character(s), how to glue the parts of the text to submit.submitted_text=headline,'.',tocHere is a headline and a toc to submit. They are connected with a "." -
An approach on the way to allow a fusion is an enhancement of the limit parameter. This will allow us to define the part of the submitted text. In the projects.conf we only need a little enhancement, that defines the starting point and the number of characters to proceed:
transform=limit(500,2000)
We (and we think the whole community) would really benefit from the implementation of a fusion with freely configurable structured data like in (1). We have to admit, that the usage of structured data would be the most favorite and clean implementation.
Best regards, Christoph, Frank, Jan-Helge and Sandro from the German National Library
A third possible approach to passing different variants/subsets of a full text to different backends, a kind of fusion of the approaches (1) and (2):
- Add a new
select(<tag>)transform, which would retain only the input text between the given tags. The tags should be removed when transforming the text.
@c-poley et al., how would you identify the different parts of the text? Would you use some existing metadata of a document to get e.g. the headline, or would a user manually input it separately in your workflow when inputting the text/file?
I started to think could there be a transform to also detect and tag some particular parts of texts, e.g. title, abstract, TOC, authors, publishers, etc. (the authors and publishers could be advantageous to be deselected and removed from the text when performing subject indexing, so there could be a deselect(<tag>) transform too).
Well, the third possibility also can fulfil our requirements. Maybe, it is important, how we connect the text parts we use. We call it "text glue". Otherwise, it can become our own job to add one more character at the end of the headline or something else. For our purpose, we have the information like "headline", "toc", "blurb" or "fulltext" separately available.
But: the mentioned idea at the end of your answer can get very interesting. Basically with such a feature, Annif moves from a toolbox for automatic suggestions to a tool box that makes it possible to identify structures in plain text with the help of algorithms and perhaps AI magic. Maybe it will become a research feature, because a low error structure extraction needs a lot of knowledge of the plain text (or that, which we think is text). One of my colleagues is in the process of looking more closely at text quality. Maybe it helps to get better suggestions. Maybe we get some side effects.
Maybe I got carried away with the text parts identification; there are dedicated packages for this, and an Annif workflow could just use such a software via its API to analyse the input documents.
Annif itself could then use either the approach (1) or (3) to pass different parts of the texts to different backends (or do some cleaning of the text, like for the mentioned authors).
There is a OmniDocBench benchmarking different document parsing software.
Some possibly useful software for such PDF layout analysis are:
-
MinerU:
- Can detect e.g. title and main text.
- An online demo here.
- Uses PDF-Extract-Kit ~and DocLayout-YOLO~ (processes images only), which could be directly useful too.
-
PaperMage:
- Aimed for scientific publications.
- Can detect many parts of the text, importantly the abstract.
- A demo here.
-
Docling
- Supports many formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown)
- Detects quite many parts, like title, headers and footers etc.
-
Surya:
- An OCR toolkit with layout analysis.
- Cannot identify as many parts as PaperMage, but still title, section and page headers.
- A demo here.
- marker is a pipeline of deep learning models including Surya
-
Parsr
- Detects headings, tables, lists, table of contents, page numbers, headers/footers, links.
-
LayoutParser
- Apparently is meant for documents in image format, not for PDFs?
- The detected text parts apparently depend on the used model.
- A demo in a documentation.
-
PDF-Extract-Kit
- (Seems quite promising?)
Anyway, the benefit of passing different variants of a document text to different backends should be evaluated before contributing too much time for the implementation. Good if someone can experiment with this!
I've been thinking about this proposed feature and how it could be implemented. Unlike the last few comments that mainly discuss extraction tools, my view is that Annif itself should not try to magically infer titles, descriptions, body texts etc. from documents - that is better left to more specialized tools. Nor should Annif enable complicated string operations for text that includes some ad-hoc custom tags or separators - it's possible, but quickly gets quite ugly IMHO.
Instead, I think that Annif should broaden its data model of a document so that it's not seen just as a single piece of text, but something a bit more structured - text plus optional, arbitrary key/value metadata that could be titles, descriptions, authors or whatever. Where those come from should be up to the user - maybe the metadata already exists in metadata records, maybe it can be extracted using AI tools from PDFs, or whatever. But Annif should be able to read and process this extra information and projects should be able to be selective in what part(s) of the input document they use as their input.
The challenge then becomes how to extend Annif to broaden its representation of a document, both in internal data structures and in interactions with the file system, CLI, REST API, web UI etc.
Here are my thoughts on what would be necessary:
Internal data structures
- the namedtuple Document should be extended - currently it defines the fields
textandsubject_set; there should be an additional fieldmetadatawhich is a dictionary, e.g.{'title': 'As We May Think', 'author': 'Bush, Vannevar'}- NOTE: Implemented in PR #864
Project configuration
- by default, projects would operate (in train, suggest and learn operations) on only the text and ignore the metadata
- NOTE: Implemented in PR #864
- this could be changed per project using a new transform, e.g.
select(title)orselect(title,description)orselect(title,text), that would select the part(s) of the document to use - text, metadata fields or combinations of them. If many fields are given, then the text in those fields would be concatenated.- NOTE: Implemented in PR #864
Corpus formats
- The fulltext corpus format (TXT + TSV) does not have a natural place for metadata because the txt file is just raw unstructured text. But we could either add a third file with metadata (in a format such as YAML, JSON or CSV?) or add support for YAML front matter which is often used with Markdown files, for example on GitHub. This way, a full text document could also include metadata within the same txt file.
- The short text TSV corpus format also isn't naturally extensible with arbitrary metadata. But we could define a new CSV corpus format that includes a header row, similar to the multilingual CSV vocabulary format that was introduced in Annif 0.59. The header row could define arbitrary metadata fields (columns) in addition to
textandsubject_uris, for exampletitleordescriptioncolumns.- NOTE: Implemented in PR #863 (basic CSV corpus format) and #864 (extensible with metadata)
CLI
- The
annif suggestcommand should include a way to specify metadata along with the text that comes from stdin, perhaps with extra command line options.- NOTE: Implemented in PR #866
- Naturally the CLI must support the new CSV corpus format in relevant operations (train, eval, learn...)
- NOTE: Implemented in PR #863
REST API
- The
suggest,suggest_batchandlearnmethods should be extended so that arbitrary key/value metadata kan be included in the request.- NOTE: Implemented in PR #867
Web UI
- The web UI should include a way to optionally add one or more metadata key/value pairs to be included in the suggest request. This could be a dynamic element - for example clicking on a
+icon adds a new row with key and value fields.
Notes
Note that none of this seems to require any major changes to individual backends, which I think is a good thing, because there are so many backends. The backends can still work with just one simple string of text at a time, there is just more flexibility in where that text comes from.
Configuration example
Here is how this could be used to configure an ensemble that processes titles separately from the main text using two different Omikuji projects (trained on titles only and fulltexts, respectively), and averages their output:
[gnd-ensemble-en]
vocab=gnd
language=en
backend=ensemble
sources=gnd-omikuji-title-en,gnd-omikuji-text-en
[gnd-omikuji-title-en]
vocab=gnd
language=en
backend=omikuji
analyzer=snowball(english)
transform=select(title)
[gnd-omikuji-text-en]
vocab=gnd
language=en
backend=omikuji
analyzer=snowball(english)
transform=select(text) # not really necessary since this is the default...
Thoughts?
Hi,
thank you for your fundamental and consistently modelled thoughts about a possible solution for the proposed flexible fusion concerning Annif. Generally, we think that it is a great idea to train, validate and submit clean structured data in order to enable a flexible fusion approach.
We need flexible fusion to avoid unfavourable model bias and loss of performance in productive use (as we described in our paper “Automatic Subject Cataloguing at the German National Library” - (https://doi.org/10.53377/lq.19422, p. 15).
In the following we have two remarks and thoughts. They only form a first statement, they also may be incomplete, because of the complexity of the topic. The long summer holidays prevent a fully discussion for the moment :o)
Corpus formats We think that the proposed extension of the formats makes sense and provides the needed flexibility. Currently, we prefer the short text TSV to prevent inode problems on our hard disks. It follows, that we need a new CSV corpus format for that. Maybe, another format style will also be suitable, the CSV table can get very complex. It is also important to determine the order of the data fields for training, validating and testing (headline, fulltext, …).
Vocabulary When we think about a flexible fusion solution, it doesn’t just depend on the training material or text data for suggestions. Flexible fusion also should be flexible in regard to the used vocabularies. It will be better to make it possible to mix them as discussed in issue [#596].
As an example we have a fusion of a Omikuji and a MLLM project. For MLLM, for instance, we want to remove descriptors that produce suboptimal results. In consequence, we have there a reduced vocabulary that we want to fuse. By the way, the used vocabularies also may be completely different, as long the descriptors are unique.
For the following development process, we would be very happy to support you as alpha testers. If you have any remarks or questions, please never hesitate to spam us ;-)
Best regards,
Christoph
Thank you @c-poley for your helpful comments!
Corpus formats We think that the proposed extension of the formats makes sense and provides the needed flexibility. Currently, we prefer the short text TSV to prevent inode problems on our hard disks. It follows, that we need a new CSV corpus format for that. Maybe, another format style will also be suitable, the CSV table can get very complex. It is also important to determine the order of the data fields for training, validating and testing (headline, fulltext, …).
Good to hear! An initial CSV corpus format has now been implemented in PR #863 (not merged yet), and extended in PR #864 to add more flexibility, including support for custom header fields and the select transform to select what fields are given to the backend.
In this solution, the order of the fields within the CSV file doesn't matter to Annif (unlike in the TSV format where columns have a fixed meaning), as the relevant columns are identified by the header row. The select transform determines which fields will be given to the backend as well as their order.
Vocabulary When we think about a flexible fusion solution, it doesn’t just depend on the training material or text data for suggestions. Flexible fusion also should be flexible in regard to the used vocabularies. It will be better to make it possible to mix them as discussed in issue [https://github.com/NatLibFi/Annif/issues/596].
As an example we have a fusion of a Omikuji and a MLLM project. For MLLM, for instance, we want to remove descriptors that produce suboptimal results. In consequence, we have there a reduced vocabulary that we want to fuse. By the way, the used vocabularies also may be completely different, as long the descriptors are unique.
Yes, indeed. As it happens, we've also been working on PR #846 which (when merged) brings support for exclude/include rules to define per-project subsets of vocabularies. I wrote more about this in https://github.com/NatLibFi/Annif/issues/596#issuecomment-3135932206 - I hope this is helpful for you too! Basic exclude support to remove/block individual problematic concepts from the vocabulary were already implemented earlier ~~and this was released in Annif 1.3.0.~~ EDIT: will be in the 1.4 release.
For the following development process, we would be very happy to support you as alpha testers. If you have any remarks or questions, please never hesitate to spam us ;-)
Well, if you are willing to test the above mentioned PRs, this would be a great time to do it! They have not been merged yet, so it's possible to make changes quite flexibly. Once the functionality is in a release, it gets more difficult to change.
Hi Osma,
Well, if you are willing to test the above mentioned PRs, this would be a great time to do it! They have not been merged yet, so it's possible to make changes quite flexibly. Once the functionality is in a release, it gets more difficult to change.
as a short reply: Yesterday, we spoke in our team about testing the new flexible fusion possibilities of Annif. Great idea, let's just get startet before you get the idea to roll out new features :smiley:. From the technical point of view, @RietdorfC will be your direct contact. Meanwhile, we began to think about suitable usecases ...
Best regards, Christoph
Hi @osma, @san-uh and i finished our tests and the test report on the flexible fusion backend. You will find our report attached.
Thanks a lot to you and your team for making the flexible fusion possible! Best regards Clemens
@RietdorfC @san-uh Thanks a lot for the thorough testing and reporting! It seems that you were able to achieve better quality with the new functionality, and didn't discover any issues in the implementation. This is great news!
As it happens, support for metadata in the JSON fulltext corpus format (which can be used with annif index) was just merged to main - see PR #872 . So that missing piece is now in place too.
The only remaining issue is support for metadata in the Web UI - maybe a separate issue should be opened for that, so that we can close this issue, which is getting rather long.
Hi @osma We created the JSONL short text corpus format and briefly tested PR #876 and #877. Our installation is based on Annif1.4.0 dev as of 26 August 2025.
About Add JSONL short text corpus format #876
An omikuji full text model and an omikuji title model (using transform=select(title)) were trained using the JSONL short text corpus format, and both models were evaluated as an ensemble. No incidents occurred; the models work as expected and work with the text assigned to them by default or via transform=select(title). The results are valid.
About Add annif index-file CLI command #877
Using the models described above, we tested the new index-file command and various options. The new command also works as described and without errors. The many variants of the output are interesting and will probably be very welcome by Annif users.
However, we have a remark regarding the -O, --output option: Regardless of the file extension (e.g., .tsv or .csv), JSONL is always written (as explained in PR #877). It would be a very useful improvement for users if it were possible to specify the output format of the process. So if --output result.tsv is specified, a TSV file is written. Since the TSV format is already the output format of the index command, this would allow the output of the index-file command to be processed in the same manner as the output of the index command, which would certainly be much easier for many Annif users.
A note on the JSON and JSONL short text corpus formas: If the format is to be for testing and experimentation (and not via the API interface), it may be necessary from the user's point of view to include an identifier (document_id) in the JSON/JSONL files in addition to text, metadata, and subjects to ensure that this information is still available after processing. This is particularly the case when the annif index-file command is applied to a JSONL corpus. The problem occured to us when we added information about the document id (like "document_id": "123") to each line of the JSONL files. We received the following warning:
warning: JSON validation failed for file json_shorttext_corpora/test/ corpus.jsonl: Additional properties are not allowed (“document_id” was unexpected)
In our opinion, including the identifier in the JSONL format is neccessary and could be useful for all users. We therefore propose to allow the identifier as an optional field in the JSON schema and to return it with the other metadata, if present.
A brief additional side note for your information: We wondered whether it is possible to combine transform:select and transform:limit, as this is one of the ways we want to use the transform parameter for a single project in an ensemble, alongside the new transform:select parameter. The (expected) answer is: It is possible to use the two functions transform:select and transform:limit together. You can find the test series attached in this file: combination-transform-select-limit.pdf.
Many thanks and best regards, Sandro (@san-uh) and Clemens
Thanks again @RietdorfC (and of course @san-uh and @c-poley) for your findings and proposals! Brief comments:
An omikuji full text model and an omikuji title model (using transform=select(title)) were trained using the JSONL short text corpus format, and both models were evaluated as an ensemble. No incidents occurred; the models work as expected and work with the text assigned to them by default or via transform=select(title). The results are valid.
Great!
Using the models described above, we tested the new index-file command and various options. The new command also works as described and without errors. The many variants of the output are interesting and will probably be very welcome by Annif users.
Good. We might still fold this functionality to the annif index command before the 1.4 release - let's see if we can make it work in a sensible way. It would be better to have just one index command instead of two different variants.
However, we have a remark regarding the -O, --output option: Regardless of the file extension (e.g., .tsv or .csv), JSONL is always written (as explained in https://github.com/NatLibFi/Annif/pull/877). It would be a very useful improvement for users if it were possible to specify the output format of the process. So if --output result.tsv is specified, a TSV file is written. Since the TSV format is already the output format of the index command, this would allow the output of the index-file command to be processed in the same manner as the output of the index command, which would certainly be much easier for many Annif users.
I see. This (supporting TSV and/or CSV output) is also roughly what was originally proposed in https://github.com/NatLibFi/Annif/issues/639#issue-1431661285 . The problem with this is that TSV and CSV are very flat formats - N rows with M columns. It's OK as long as there are no multiple values per field (the short-text TSV format is already stretching this limit, with multiple subject URIs separated by whitespace). But there can be many subject suggestions per file, and they come with labels, URIs, scores...this becomes difficult to squeeze into a flat structure. One obvious solution (as proposed in #639) is to have multiple rows per document, with a document_id column identifying the document. But this is no longer exactly the same as the old TSV format that annif index produces, because there is now at least one more column!
If you have a use case for this, feel free to open a new issue with a proposal of how the CSV and/or TSV output would look.
A note on the JSON and JSONL short text corpus formas: If the format is to be for testing and experimentation (and not via the API interface), it may be necessary from the user's point of view to include an identifier (document_id) in the JSON/JSONL files
Yes, this is a good idea - and one that I considered during implementation of index-text but decided then to leave it for later. But since you also proposed it, I created a new issue to track this.
A brief additional side note for your information: We wondered whether it is possible to combine transform:select and transform:limit, as this is one of the ways we want to use the transform parameter for a single project in an ensemble, alongside the new transform:select parameter. The (expected) answer is: It is possible to use the two functions transform:select and transform:limit together. You can find the test series attached in this file: combination-transform-select-limit.pdf.
Yes, it is possible to combine multiple transforms in a row, separated by commas, as you already found out. This is also mentioned at the top of the Transforms wiki page and used in the examples on that page.
FYI @RietdorfC @c-poley @san-uh, the separate annif index and annif index-file commands have been consolidated into a single annif index command (see #889) and support for document_id has been added (see #885 and #886).
We will proceed with the 1.4 release next week. The code is ready (I think), some documentation still needs to be updated.
Hi Osma,
great to read hat you plan to release the next Annif version in a few days - fantastic! We already started to plan the changes in our EMa.
According to the json format we thought a little bit about how to work with the "text"-identifier and the metadata block in this format. Maybe, we put our fulltexts, TOC and later chapters all together into the metadata block, because from the technical point of view we always work with strings to train the models. The semantic aspect is in our brains 😊 . When we regard to the transform=select(...,...) in the projects.cfg, metadata and text are already listed without any hierarchiy.
Have a nice weekend, Christoph
I am closing this issue because I think that all, or at least most, of the important aspects have been implemented on the main branch, which will be released as 1.4 soon.
If you feel that something is still missing, please open a new more specific issue.