mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Add SWIM-IR

Open rasdani opened this issue 1 year ago • 5 comments

Google released a new crosslingual retrieval dataset: https://huggingface.co/datasets/nthakur/swim-ir-cross-lingual

We could turn a subset of this into a retrieval and reranking benchmark.

If no one picks this up, I can take at look at this during the weekend.

rasdani avatar May 02 '24 08:05 rasdani

Amazing. Feel free to open a PR :)

isaac-chung avatar May 05 '24 07:05 isaac-chung

That'd be great indeed cc @thakur-nandan

Muennighoff avatar May 09 '24 04:05 Muennighoff

Thanks @Muennighoff. The SWIM-IR dataset would be great and contains training splits only as it should be used for training. If that would be desirable we can go ahead and add it into MTEB.

Let me know if you need help @rasdani.

Thanks, Nandan

thakur-nandan avatar May 10 '24 16:05 thakur-nandan

Thanks @Muennighoff. The SWIM-IR dataset would be great and contains training splits only as it should be used for training. If that would be desirable we can go ahead and add it into MTEB.

Let me know if you need help @rasdani.

Thanks, Nandan

Oh does it still make sense to use it for evaluation or not at all? Not sure if adding a training dataset makes sense cc @KennethEnevoldsen

Muennighoff avatar May 10 '24 16:05 Muennighoff

I wouldn't add a dataset intended for training unless we expect it to evaluate an aspect which we are currently not evaluating.

KennethEnevoldsen avatar May 11 '24 13:05 KennethEnevoldsen