Implement hierarchical precision, recall and F1 scores
The existing evaluation metrics (e.g. F1 and NDCG) are not ideal for evaluating hierarchical classification problems, e.g. assigning DDC or YKL classes, because they don't consider the hierarchy of the classification. An answer that is just one level above or below the "correct" answer is considered just as bad as an answer that is in a completely different branch of the hierarchy.
Many measures have been proposed to better account for hierarchy, but I think the hierarchical precision, recall and F1 scores (hP, hR and hF) from this paper would be good:
Kiritchenko, S., Matwin, S., & Famili, A. F. (2005, June). Functional annotation of genes using hierarchical text categorization. In Proc. of the ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics.
They are also used in this Master's thesis dealing with automated Thema classification: Reinaudo, Alice: Hierarchical text classification of fiction books with Thema subject categories (Linköping University 2019)
The implementation should be relatively straightforward - existing functions for calculating precision/recall/F1 can be used, but both the set of suggested subjects and the set of gold standard subjects first have to be expanded to include the parent (broader) concepts/classes, all the way up to the ancestors at top level of the hierarchy. However, Annif currently knows very little about the vocabulary hierarchy (except for the STWFSA algorithm, but there it's only processed within the stwfsapy library), so some support for accessing the hierarchy needs to be added.
owever, Annif currently knows very little about the vocabulary hierarchy (except for the STWFSA algorithm, but there it's only processed within the stwfsapy library), so some support for accessing the hierarchy needs to be added.
This has changed a bit now since the MLLM backend was implemented, as it also makes use of the vocabulary hierarchy. New methods for accessing the hierarchy are in annif/lexical/util.py.