mteb Cannot reproduce results for `google/embeddinggemma-300m`

Hi @RyanMullins @schechterh, I am running the model for MTEB eval trying to reproduce the results.

import mteb
from sentence_transformers import SentenceTransformer
# Define the sentence-transformers model name

model_name = "google/embeddinggemma-300m"
model = mteb.get_model(model_name)
# model =  SentenceTransformer(model_name)
 
tasks = mteb.get_tasks(tasks=["ArguAna"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results_{model_name}_mteb", verbosity=1, overwrite_results=True)

I am getting ndcg_at_10: 0.30593 vs 0.71535 reported. Can you point out the error/bug? It will be very helpful

Originally posted by @ShreyGanatra in https://github.com/embeddings-benchmark/results/issues/269#issuecomment-3301461665

Sep 17 '25 11:09 Samoed

We use task specific prompts for all of our evals and we didn’t include them in the SentenceTransformers config. I’ll work with Henrique to publish them somewhere for you.

Sep 17 '25 12:09 RyanMullins

Prompts are there in the config_sentence_transformers.json

{
  "model_type": "SentenceTransformer",
  "__version__": {
    "sentence_transformers": "5.1.0",
    "transformers": "4.57.0.dev0",
    "pytorch": "2.8.0+cu128"
  },
  "prompts": {
    "query": "task: search result | query: ",
    "document": "title: none | text: ",
    "BitextMining": "task: search result | query: ",
    "Clustering": "task: clustering | query: ",
    "Classification": "task: classification | query: ",
    "InstructionRetrieval": "task: code retrieval | query: ",
    "MultilabelClassification": "task: classification | query: ",
    "PairClassification": "task: sentence similarity | query: ",
    "Reranking": "task: search result | query: ",
    "Retrieval": "task: search result | query: ",
    "Retrieval-query": "task: search result | query: ",
    "Retrieval-document": "title: none | text: ",
    "STS": "task: sentence similarity | query: ",
    "Summarization": "task: summarization | query: "
  },
  "default_prompt_name": null,
  "similarity_fn_name": "cosine"
}

And even used during mteb eval.

Sep 17 '25 12:09 ShreyGanatra

Yeah but those are the Category level prompts. Some Tasks within a Category use different prompt than the broader Category, and as you can see this can half a massive influence on the model’s performance. There’s a dozen or so Tasks that work that way. I’ll filter down to the Task specific prompts that differ from their category and add those to the config on HF Hub. There might still be some differences due to 1) the use of titles instead of “none” for document tasks, and 2) JAX v Torch implementations, but they should be comparable in the aggregate.

Sep 17 '25 13:09 RyanMullins

Yeah, if you could paste the prompts here for a quick reference that will be helpful

Sep 17 '25 18:09 ShreyGanatra

Using pip install -U sentence-transformers git+https://github.com/huggingface/[email protected] the score increases to 0.65847.

Sep 17 '25 19:09 ShreyGanatra

Ah, yes. Probably the biggest problem was that we were using an old version of transformers, but correct prompts are still required for reproducible evaluation. Maybe we should raise an error for this model if the installed transformers version is lower than what the model requires.

Sep 17 '25 20:09 Samoed

Created PR to check version of the embedding-gemma model https://github.com/embeddings-benchmark/mteb/pull/3189

Sep 17 '25 20:09 Samoed

@RyanMullins, ideally, the implementation on MTEB should reflect 1-1 what you ran (and thus what a user can expect), including prompts, so do feel free to submit a PR to add these (in case you don't want them in the model repo).

If you prefer that implementation to be in JAX, that is also perfectly fine.

Sep 18 '25 10:09 KennethEnevoldsen

@RyanMullins can you share your prompts?

Oct 13 '25 09:10 Samoed

is this fixed ?

I tried

import mteb
from sentence_transformers import SentenceTransformer


model_name = "google/embeddinggemma-300m"
model = mteb.get_model(model_name)

evaluation = mteb.MTEB(tasks=["StackExchangeClustering.v2"])
results = evaluation.run(model)

I am getting

"v_measure": 0.673284,
"v_measure_std": 0.00591,
"main_score": 0.673284,

But in leaderboard mentioned : 90.94

The performance gap is quite large. If I’m making any mistake here, please help me identify where I went wrong.

Nov 03 '25 20:11 jaswanth-0821

I don't think that you've done anything wrong. Probably prompt difference

Nov 04 '25 16:11 Samoed

@Samoed unfortunately I can't, but I can confirm that this is a prompting difference. Sorry I can't provide or say more, but I'm limited by the powers that be.

Nov 05 '25 19:11 RyanMullins