mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Overview: Leaderboard release

Open x-tabdeveloping opened this issue 11 months ago • 26 comments

Since we would like to release the leaderboard as soon as possible (especially since the paper got accepted for ICLR), I would love to open a discussion about what we consider to be the minimum requirements for publishing the new leaderboard. I highly doubt that we will be able to fix all issues right away, but we should, in any case focus in on a couple of them that are crucial for the new leaderboard to be in a releasable state.

Here are some of my criteria:

VITAL PROBLEMS:

  • [x] We need to fix task aggregation. This has been fixed by introducing aggregated tasks and manually aggregating results on CQADupstackRetrieval,
  • [ ] Some models (e.g. Jasper, voyage-2-large-instruct, Cohere, etc.) are still missing scores on MSMARCO. This has been fixed to a certain extent, but some models were not run on the dev split of MSMARCO, these need to be run #1898
  • [x] We should create documentation on how to submit new models and results, and should direct people to the docs from the leaderboard. (possibly issue templates) https://github.com/embeddings-benchmark/mteb/issues/1868
  • [x] We should add task metadata to all tasks in MTEB(eng, classic) #1886 (@imenelydiaker is working on this #1895 )

I have tried implementing as many model metas the last couple of days as humanly possible, but this has been incredibly time consuming. If you still see models missing that you think should definitely be there, then feel free to comment here.

Nice to haves:

  • [ ] Cross encoder filtering is missing (#1841 ) This doesn't work in the old leaderboard either, so we are technically not behind with this, but it would be great to get it to work in the new one.
  • [x] New banner (#1855 ) Probably not too difficult to make one.
  • [x] https://github.com/embeddings-benchmark/mteb/issues/1935

THIS IS JUST MY JUDGEMENT, PLEASE FEEL FREE TO ADD THINGS, I TOTALLY MIGHT BE MISSING SOMETHING

@Samoed @KennethEnevoldsen @Muennighoff @orionw @isaac-chung @imenelydiaker @tomaarsen

x-tabdeveloping avatar Jan 24 '25 14:01 x-tabdeveloping

@x-tabdeveloping thanks for suggesting these! I agree with the vital list here, and can help with 3 (docs) and/or 1 (agg).

Re: 4, what would that look like? Maybe a) disable the update cron and b) add a banner/message to the app to link to the new leaderboard?

There's also a list of "must have's" + "nice to have's" issues kept in this comment, and the only must have left seems to be related to missing model results. It would be great if we can update the linked issue and establish what are the must haves within it.

isaac-chung avatar Jan 24 '25 15:01 isaac-chung

Great overview; Maybe an alternative to freezing is just focusing on all other issues first and then when everything else is done at the end, we could go do another round of syncing?

Muennighoff avatar Jan 24 '25 15:01 Muennighoff

For jasper and voyage they evaluated only test set of MSMARCO, but in leaderboard we recently changed it for dev #1620

Samoed avatar Jan 24 '25 16:01 Samoed

Hmm strange, but why do they show up in the old leaderboard then? Shouldn't we strive for 100% feature parity?

x-tabdeveloping avatar Jan 24 '25 17:01 x-tabdeveloping

I couldn’t find it initially, but it seems we do have their scores on the dev split. However, when loading with res = mteb.load_results(models=["infgrad/jasper_en_vision_language_v1"], tasks=["MSMARCO"]), I see the log message: MSMARCO: Missing splits {'dev'}. Maybe it's loading only one revision of results

Samoed avatar Jan 24 '25 18:01 Samoed

Root cause

It seems like the results file linked in the comment above is from the revision external. The results file of the other 2 model revisions of the jasper model did not contain dev splits. In fact, this file must have been loaded to yield the error message above.

Proposed fix

Rerun MSMARCO on the latest model revision, or specify the external revision which contains a dev split result.

isaac-chung avatar Jan 25 '25 05:01 isaac-chung

We can overwrite Jasper's and Voyage's revision in the metadata to external, then that would get highest precedence when loading results. I think this would be the most painless, though not an optimal solution. What do you think @isaac-chung @Samoed ?

x-tabdeveloping avatar Jan 27 '25 13:01 x-tabdeveloping

Or another one would be to delete the result files without the dev split from the results repo.

x-tabdeveloping avatar Jan 27 '25 13:01 x-tabdeveloping

@x-tabdeveloping the overwrite option is fine and I think we can go for it, but note that it'll only buy us some time: anyone who runs these models will produce result files under the 'external' revision, which is not desirable.

Let's open an issue so that we would eventually rerun these models on MSMARCO with non-external revisions as well. How does that sound?

isaac-chung avatar Jan 27 '25 13:01 isaac-chung

How about we just remove the newer results on the problematic tasks from the results folder? Then we can rerun in the future if need be, and if people run the models now, they will get the correct revision. (also note that I don't think we have an actionable implementation of Jasper in mteb as of yet due to it being a multimodal model)

x-tabdeveloping avatar Jan 27 '25 13:01 x-tabdeveloping

That sounds good.

isaac-chung avatar Jan 27 '25 14:01 isaac-chung

I've managed to find another pretty burning issue that we need to fix before launching the leaderboard: #1886 A lot of tasks in the MTEB(eng, classic) benchmark are missing a lot of task metadata, including domains, which is vital to the leaderboard filtering process.

ArxivClusteringS2S.domains = None
AskUbuntuDupQuestions.domains = None
BIOSSES.domains = None
CQADupstackAndroidRetrieval.domains = None
CQADupstackEnglishRetrieval.domains = None
CQADupstackGamingRetrieval.domains = None
CQADupstackGisRetrieval.domains = None
CQADupstackMathematicaRetrieval.domains = None
CQADupstackPhysicsRetrieval.domains = None
CQADupstackStatsRetrieval.domains = None
CQADupstackTexRetrieval.domains = None
CQADupstackUnixRetrieval.domains = None
CQADupstackWebmastersRetrieval.domains = None
CQADupstackWordpressRetrieval.domains = None
ClimateFEVER.domains = None
FEVER.domains = None
FiQA2018.domains = None
NQ.domains = None
QuoraRetrieval.domains = None
RedditClustering.domains = None
RedditClusteringP2P.domains = None
STSBenchmark.domains = None
StackExchangeClustering.domains = None
StackExchangeClusteringP2P.domains = None
StackOverflowDupQuestions.domains = None
TwitterSemEval2015.domains = None
TwitterURLCorpus.domains = None
MSMARCO.domains = None

x-tabdeveloping avatar Jan 28 '25 08:01 x-tabdeveloping

@x-tabdeveloping I'll fill them out, I think I did some of them for the paper and forgot to add them to TaskMetadata, my bad 😅 On which branch shoud I push the changes? v2.0.0?

imenelydiaker avatar Jan 28 '25 13:01 imenelydiaker

I think main @imenelydiaker ! So that we can release the leaderboard

x-tabdeveloping avatar Jan 28 '25 14:01 x-tabdeveloping

@isaac-chung @Samoed I might be able to fix the issue with the results in code, I will update you about it.

x-tabdeveloping avatar Jan 28 '25 14:01 x-tabdeveloping

Okay, so I have fixed the cases where the results are present, but in the external results folder. On the other hand, for some models, like voyage2-large-instruct we are missing the dev split completely on MSMARCO. How can it be the case that it is present in the old leaderboard if we don't have the scores on the dev split??

x-tabdeveloping avatar Jan 28 '25 14:01 x-tabdeveloping

I believe this is a bug, as the test split for voyage-large-2-instruct is not found in the MSMARCO results repository. The results repository checks if the specified split is present in the results dictionary with the default test, but this time it could not find dev split and used the test as fallback, because it is present in dict.

Samoed avatar Jan 28 '25 15:01 Samoed

So, in conclusion, it is a bug with the old leaderboard, and the only way to go about fixing it is for us to run MSMARCO's dev split on these models. Is this a correct assessment?

x-tabdeveloping avatar Jan 28 '25 15:01 x-tabdeveloping

Unfortunately yes

Samoed avatar Jan 28 '25 15:01 Samoed

@x-tabdeveloping I can work on the banner, except if there is something more pressing I can help with (seems like other items are in the works).

wissam-sib avatar Jan 28 '25 17:01 wissam-sib

Sure thing @wissam-sib! By all means go ahead

x-tabdeveloping avatar Jan 28 '25 20:01 x-tabdeveloping

Sure thing @wissam-sib! By all means go ahead

Cool, I've started here: https://github.com/embeddings-benchmark/mteb/pull/1908

wissam-sib avatar Jan 30 '25 09:01 wissam-sib

Looks like we're down to the last vital issue before the release!

isaac-chung avatar Jan 31 '25 11:01 isaac-chung

@Muennighoff Can you help us out with it? Some models don't have MSMARCO results at all on the dev split, and we might need to run them.

x-tabdeveloping avatar Jan 31 '25 11:01 x-tabdeveloping

Yes will try to run them this weekend! Amazing work on everything 🚀🚀🚀

Muennighoff avatar Jan 31 '25 22:01 Muennighoff

The leaderboard is getting really close to being ready. @x-tabdeveloping and I manually reviewed each leaderboard to compare and found a few remaining issues. They are generally an issue of specification differences between the benchmarks.py and the current v1 of the leaderboard. We also have a few missing results. Some of which Niklas @Muennighoff is rerunning, but others are newer model releases (<1 month old). For these, we have reached out to the authors to let them know about the changes. For the inconsistencies, we have asked the benchmark contact (e.g., @imenelydiaker for French) to points to clarify which version is desired.

We are planning to do the release on Tuesday next week.

There are a few missing scores and inconsistencies:

  1. Russian: Some newer model releases
  2. MTEB(eng, classic): #1898
  3. French: #1919
  4. Polish: #1917

KennethEnevoldsen avatar Feb 01 '25 12:02 KennethEnevoldsen