erlexec
erlexec copied to clipboard
Support Labeled dataset as Ground Truth in Evaluate Function
Currently, the evaluate function of docarray expects a ground truth in form of a "session dataset". This means that each query in the groundtruth needs a list of all potential matches. In the extreme case where each document matches every other document, each query has a list of all documents in the index. Accordingly, this leads to a quadratic memory growth.
While this is a bit unrealistic, there are many applications where multiple queries have the same labels and therefore, exactly the same matches (e.g. duplicate detection) which produces a unnecessary high memory consumption
In order to make the evaluation more memory efficient, I propose to allow the evaluator to accept a class dataset (as alternatively to a session dataset). This dataset should be a document array which contains labels for each document in the form of a tag. In this way every document (and a reference to it) only needs to be present once in the ground truth.