Source of training queries of BGE-EN-ICL (BEIR datasets)

quora: 10k test, 5k dev queries in beir -- bge-full-data has 60202 queries.
scidocsrr: 1k test queries in beir -- bge-full-data has 12654 queries.
arguana: 1406 test queries in beir -- bge-full-data has 3101 queries.

Open ftvalentini opened this issue 8 months ago • 1 comments

I have some questions regarding the origin of the training queries used for BGE-EN-ICL, which have no training queries in BEIR:

Where do these train queries come from?

Also for nli dataset: what is the source dataset?

Thank you so much for making such a valuable dataset available!

Mar 27 '25 16:03 ftvalentini

The dataset above is sourced from the reference paper in BEIR. The nli dataset is sourced from sentence-transformers/nli-for-simcse.

Apr 10 '25 06:04 545999961