Cannot reproduce results for `google/embeddinggemma-300m`
Hi @RyanMullins @schechterh, I am running the model for MTEB eval trying to reproduce the results.
import mteb
from sentence_transformers import SentenceTransformer
# Define the sentence-transformers model name
model_name = "google/embeddinggemma-300m"
model = mteb.get_model(model_name)
# model = SentenceTransformer(model_name)
tasks = mteb.get_tasks(tasks=["ArguAna"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results_{model_name}_mteb", verbosity=1, overwrite_results=True)
I am getting ndcg_at_10: 0.30593 vs 0.71535 reported. Can you point out the error/bug? It will be very helpful
Originally posted by @ShreyGanatra in https://github.com/embeddings-benchmark/results/issues/269#issuecomment-3301461665
We use task specific prompts for all of our evals and we didn’t include them in the SentenceTransformers config. I’ll work with Henrique to publish them somewhere for you.
Prompts are there in the config_sentence_transformers.json
{
"model_type": "SentenceTransformer",
"__version__": {
"sentence_transformers": "5.1.0",
"transformers": "4.57.0.dev0",
"pytorch": "2.8.0+cu128"
},
"prompts": {
"query": "task: search result | query: ",
"document": "title: none | text: ",
"BitextMining": "task: search result | query: ",
"Clustering": "task: clustering | query: ",
"Classification": "task: classification | query: ",
"InstructionRetrieval": "task: code retrieval | query: ",
"MultilabelClassification": "task: classification | query: ",
"PairClassification": "task: sentence similarity | query: ",
"Reranking": "task: search result | query: ",
"Retrieval": "task: search result | query: ",
"Retrieval-query": "task: search result | query: ",
"Retrieval-document": "title: none | text: ",
"STS": "task: sentence similarity | query: ",
"Summarization": "task: summarization | query: "
},
"default_prompt_name": null,
"similarity_fn_name": "cosine"
}
And even used during mteb eval.
Yeah but those are the Category level prompts. Some Tasks within a Category use different prompt than the broader Category, and as you can see this can half a massive influence on the model’s performance. There’s a dozen or so Tasks that work that way. I’ll filter down to the Task specific prompts that differ from their category and add those to the config on HF Hub. There might still be some differences due to 1) the use of titles instead of “none” for document tasks, and 2) JAX v Torch implementations, but they should be comparable in the aggregate.
Yeah, if you could paste the prompts here for a quick reference that will be helpful
Using
pip install -U sentence-transformers git+https://github.com/huggingface/[email protected] the score increases to 0.65847.
Ah, yes. Probably the biggest problem was that we were using an old version of transformers, but correct prompts are still required for reproducible evaluation. Maybe we should raise an error for this model if the installed transformers version is lower than what the model requires.
Created PR to check version of the embedding-gemma model https://github.com/embeddings-benchmark/mteb/pull/3189
@RyanMullins, ideally, the implementation on MTEB should reflect 1-1 what you ran (and thus what a user can expect), including prompts, so do feel free to submit a PR to add these (in case you don't want them in the model repo).
If you prefer that implementation to be in JAX, that is also perfectly fine.
@RyanMullins can you share your prompts?
is this fixed ?
I tried
import mteb
from sentence_transformers import SentenceTransformer
model_name = "google/embeddinggemma-300m"
model = mteb.get_model(model_name)
evaluation = mteb.MTEB(tasks=["StackExchangeClustering.v2"])
results = evaluation.run(model)
I am getting
"v_measure": 0.673284,
"v_measure_std": 0.00591,
"main_score": 0.673284,
But in leaderboard mentioned : 90.94
The performance gap is quite large. If I’m making any mistake here, please help me identify where I went wrong.
I don't think that you've done anything wrong. Probably prompt difference
@Samoed unfortunately I can't, but I can confirm that this is a prompting difference. Sorry I can't provide or say more, but I'm limited by the powers that be.