erlexec Support Labeled dataset as Ground Truth in Evaluate Function

Support Labeled dataset as Ground Truth in Evaluate Function

Open guenthermi opened this issue 3 years ago • 1 comments

trafficstars

Currently, the evaluate function of docarray expects a ground truth in form of a "session dataset". This means that each query in the groundtruth needs a list of all potential matches. In the extreme case where each document matches every other document, each query has a list of all documents in the index. Accordingly, this leads to a quadratic memory growth.

While this is a bit unrealistic, there are many applications where multiple queries have the same labels and therefore, exactly the same matches (e.g. duplicate detection) which produces a unnecessary high memory consumption

In order to make the evaluation more memory efficient, I propose to allow the evaluator to accept a class dataset (as alternatively to a session dataset). This dataset should be a document array which contains labels for each document in the form of a tag. In this way every document (and a reference to it) only needs to be present once in the ground truth.

Sep 14 '22 09:09 guenthermi

erlexec erlexec copied to clipboard

Support Labeled dataset as Ground Truth in Evaluate Function

erlexec
erlexec copied to clipboard