esci-data icon indicating copy to clipboard operation
esci-data copied to clipboard

trec_eval

Open malinphy opened this issue 1 year ago • 5 comments

I would like to express my gratitude for sharing your repository and the associated dataset. Upon reviewing the ranking/prepare_trec_eval_files.py script, I encountered the following parameters:

max_trec_eval_score = 128
min_trec_eval_score = 0
l_score += list(np.arange(min_trec_eval_score, max_trec_eval_score, max_trec_eval_score / n).round(3)[::-1][:n])

I was wondering if you could provide some clarification on their purpose. Specifically, I'm curious to know whether these parameters are intended for some form of scaling or normalization within the context of Terrier usage. Thank you once again for your contribution.

malinphy avatar Feb 13 '24 22:02 malinphy

Hi @malinphy,

Yes, we scaled the outputs in order to use Terrier. However, the score value for ndcg does not matter, it is only important the rank/order, so it is fine for our case.

Thanks for your question :)

franbvalero avatar Feb 15 '24 09:02 franbvalero

@franbvalero Is there any @k limit for ndcg metrics for example 5,10 or 20? It is not mentioned in the paper(Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search).

malinphy avatar Apr 28 '24 13:04 malinphy

Our K is for all the labeled documents given a query. We set that in the arguments of trec_eval here. However, you can also specify the value of k, if you need it.

franbvalero avatar Apr 30 '24 16:04 franbvalero

@franbvalero Thanks for the answer. I guess, you mentioned this line:

$1/terrier trec_eval "${TREC_EVAL_DATA_PATH}/test.qrels" "${TREC_EVAL_DATA_PATH}/hypothesis.results" -c -J -m 'ndcg.1=0,2=0.01,3=0.1,4=1'

I'm not familiar with the Terrier IR platform, but based on the explanation, it seems that this line calculates the Normalized Discounted Cumulative Gain (NDCG) metric at cutoff positions 1, 2, 3, and 4, using the specified relevance weights (0, 0.01, 0.1, and 1 for relevance levels 0, 1, 2, and 3 or higher, respectively).

malinphy avatar May 01 '24 20:05 malinphy

Yes. In this case, we did not set K limit of documents. With the argument -J we consider all the documents containing relevance labels given a a query.

-J: Calculate all values only over the judged (either relevant or nonrelevant) documents.  All unjudged documents are removed from the retrieved set before any calculations (possibly leaving an empty set). DO NOT USE, unless you really know what you're doing - very easy to get reasonable looking numbers in a file that you will later forget were calculated  with the -J flag.

If I am not mistaking the output of trec_eval gives you by default the nDCG score for several top K values. However, you can set a specific value if you want with the following argument:

 -M <num>: Max number of docs per topic to use in evaluation (discard rest). Default is MAX_LONG.

franbvalero avatar May 03 '24 16:05 franbvalero