mteb
mteb copied to clipboard
Update results
Hi, Teams!
The other day I submit the results of facebook-dpr-ctx_encoder-multiset-base
in this PR. But later I realize that I made a mistake. For DPR model, it has separate query encoder and doc encoder, which means I shouldn't simply evaluate the performance of DPR with this:
from sentence_transformers import SentenceTransformer
model_name = "facebook-dpr-ctx_encoder-multiset-base"
model = SentenceTransformer(model_name)
Instead, I should implement the model like this:
class DPRModel():
def __init__(self):
self.devcie = torch.device("cuda")
self.context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
self.context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base").to(self.devcie)
self.query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
self.query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base").to(self.devcie)
self.sep=" "
def encode(self, sentences, batch_size=128, **kwargs):
sentences = [sentences[idx:idx+batch_size] for idx in range(0,len(sentences),batch_size)]
embeddings = []
for s in sentences:
inputs = self.context_tokenizer(s, padding=True, truncation=True, return_tensors='pt',max_length=512,).to(self.devcie)
with torch.no_grad():
embed = self.context_encoder(**inputs).pooler_output.cpu().numpy()
embeddings.append(embed)
return np.concatenate(embeddings,axis=0)
def encode_queries(self, queries, batch_size=32, **kwargs):
queries = [queries[idx:idx+batch_size] for idx in range(0,len(queries),batch_size)]
embeddings = []
for q in queries:
inputs = self.query_tokenizer(q, padding=True, truncation=True, return_tensors='pt',max_length=512,).to(self.devcie)
with torch.no_grad():
embed = self.query_encoder(**inputs).pooler_output.cpu().numpy()
embeddings.append(embed)
return np.concatenate(embeddings,axis=0)
def encode_corpus(self, corpus, batch_size=32, **kwargs):
if type(corpus) is dict:
corpus = [
(corpus["title"][i] + self.sep + corpus["text"][i]).strip()
if "title" in corpus
else corpus["text"][i].strip()
for i in range(len(corpus["text"]))
]
else:
corpus = [
(doc["title"] + self.sep + doc["text"]).strip() if "title" in doc else doc["text"].strip()
for doc in corpus
]
corpus = [corpus[idx:idx+batch_size] for idx in range(0,len(corpus),batch_size)]
embeddings = []
for c in corpus:
inputs = self.context_tokenizer(c, padding=True, truncation=True, return_tensors='pt',max_length=512,).to(self.devcie)
with torch.no_grad():
embed = self.context_encoder(**inputs).pooler_output.cpu().numpy()
embeddings.append(embed)
return np.concatenate(embeddings,axis=0)
So I want to confirm if this is the right way to evaluate a dual encoder model?
If so, could you please consider merge these two PRs DPR Dragon with the updated results?
Thanks @Hannibal046 this looks very reasonable, will merge it in.
@imenelydiaker, @Muennighoff we talked about adding a scripts folder for running models, what did we end up deciding on that one?
Hello @Hannibal046 ,
Actually, I don't see the difference between the 2 tokenizers and 2 encoders you added in your implementation.
The tokenizers for queries and context are the same (BERTTokenizer, see docs here: https://huggingface.co/docs/transformers/main/en/model_doc/dpr#transformers.DPRQuestionEncoderTokenizer)
The encoders both do a pooling just as SentenceTransformers
does (look at the docs https://huggingface.co/docs/transformers/main/en/model_doc/dpr#transformers.DPRQuestionEncoderr). The library already handles dual encoders, a dual encoder is the same encoder used 2 times to embed two input sentences.
The only additional thing you're doing in the encoders is concatenating title
and text
to build a corpus. I think the MTEB tasks already do that, and that shouldn't be modified to make sure all models are evaluated on the same datasets.
After checking the results submitted in the previous PR and the new one, I see that you have the same results.
Something is wrong in the PR, your results should be added to the results folder not to the root:
Also, there will be duplicated results at the end (we should restart the leaderboard for them to apprear) because the results didn't overwrite the previous ones as they don't have the same folder name.
I suggest to remove this evaluation since the first evaluation using SentenceTransformers
is correct.
Hi, @imenelydiaker Thanks so much for the careful check!
Actually, the two DPR model are different because DPR is not a universal text embedding model, it is specially designed for retrieval task and has two separate encoders. Here separate means two set of parameters, so even if they have same tokenizer and pooling method, for the same input, they would produce different output:
Sorry for mistakenly upload to wrong directory. But I am surprised that we got the same results. Is there something wrong with my implementation above? Could you please double check it? For the title concatenation part, I copy from here: https://github.com/embeddings-benchmark/mteb/blob/b08913f8616c580f8bbb15bfa808549e2b74912a/mteb/evaluation/evaluators/RetrievalEvaluator.py#L162-L176
Okay my bad then. Maybe you should reopen a PR and put the results in the old folder so we can see the differences in results ?
Okay my bad then. Maybe you should reopen a PR and put the results in the old folder so we can see the differences in results ?
Sure
Actually, SentenceTransformers
proposes two models: one for queries https://huggingface.co/sentence-transformers/facebook-dpr-question_encoder-single-nq-base
and one for context : https://huggingface.co/sentence-transformers/facebook-dpr-ctx_encoder-multiset-base
Can we use these two models simultaneously for MTEB? If so, it would be very convenient to benchmark dual encoder!
BTW, do I have to wait for the refresh of the leaderboard since I tried it myself a few times, but still couldn't find the Dragon
results merged hours ago.
Can we use these two models simultaneously for MTEB? If so, it would be very convenient to benchmark dual encoder!
BTW, do I have to wait for the refresh of the leaderboard since I tried it myself a few times, but still couldn't find the
Dragon
results merged hours ago.
I don't thinks it's handled by MTEB, I think we use the same model for encoding queries and context. Maybe @Muennighoff can confirm this ?
I'll restart the leaderboard once this is merged https://huggingface.co/datasets/mteb/results/discussions/36, and the new results appear 🙂
Thanks for your assistance! I have make several new PRs to meet community requirements. https://huggingface.co/datasets/mteb/results/discussions/38 https://huggingface.co/datasets/mteb/results/discussions/39 https://huggingface.co/datasets/mteb/results/discussions/40
Can we use these two models simultaneously for MTEB? If so, it would be very convenient to benchmark dual encoder! BTW, do I have to wait for the refresh of the leaderboard since I tried it myself a few times, but still couldn't find the
Dragon
results merged hours ago.I don't thinks it's handled by MTEB, I think we use the same model for encoding queries and context. Maybe @Muennighoff can confirm this ?
I'll restart the leaderboard once this is merged https://huggingface.co/datasets/mteb/results/discussions/36, and the new results appear 🙂
You can use different models via encode_queries
& encode_corpus
as explained here
Actually to make merged results appear in the leaderboard, there's a few additional steps:
- Merge PR into results
- ONLY IF NEW MODEL: Add its name here
- Recreate paths: Clone the results repo to have it locally, cd into it, open a Python interpreter, cp this function into it https://huggingface.co/datasets/mteb/results/blob/main/results.py#L164 & run it (Currently there is a problem with some files having the same name on Mac due to capitalization insensitivity, so you should rather clone it on a OS that distinguishes capitalization like Linux)
-
ONLY IF NEW MODEL: Add the Specs of the model to all
EXTERNAL_...
global variables in the leaderboard code i.e. here etc. - Restart the leaderboard space
I'm happy to do it if it's too complex, but maybe useful for everyone to know so you can also do it whenever you want :) Also if people have ideas for simplifying it, ofc welcome. Maybe 1., 2. & 3. can be automated somehow.
I believe this issue is resolved - will close it for now