mteb Update results

Hi, Teams!

The other day I submit the results of facebook-dpr-ctx_encoder-multiset-base in this PR. But later I realize that I made a mistake. For DPR model, it has separate query encoder and doc encoder, which means I shouldn't simply evaluate the performance of DPR with this:

from sentence_transformers import SentenceTransformer
model_name = "facebook-dpr-ctx_encoder-multiset-base"
model = SentenceTransformer(model_name)

Instead, I should implement the model like this:

class DPRModel():
    def __init__(self):
        self.devcie = torch.device("cuda")
        self.context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
        self.context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base").to(self.devcie)

        self.query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
        self.query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base").to(self.devcie)
        self.sep=" "
       
    def encode(self, sentences, batch_size=128, **kwargs):
        sentences = [sentences[idx:idx+batch_size] for idx in range(0,len(sentences),batch_size)]
        embeddings = []
        for s in sentences:
            inputs = self.context_tokenizer(s, padding=True, truncation=True, return_tensors='pt',max_length=512,).to(self.devcie)
            with torch.no_grad():
                embed = self.context_encoder(**inputs).pooler_output.cpu().numpy()
            embeddings.append(embed)
        return np.concatenate(embeddings,axis=0)
    
    def encode_queries(self, queries, batch_size=32, **kwargs):
        queries = [queries[idx:idx+batch_size] for idx in range(0,len(queries),batch_size)]
        embeddings = []
        for q in queries:
            inputs = self.query_tokenizer(q, padding=True, truncation=True, return_tensors='pt',max_length=512,).to(self.devcie)
            with torch.no_grad():
                embed = self.query_encoder(**inputs).pooler_output.cpu().numpy()
            embeddings.append(embed)
        return np.concatenate(embeddings,axis=0)
    
    def encode_corpus(self, corpus, batch_size=32, **kwargs):
        if type(corpus) is dict:
            corpus = [
                (corpus["title"][i] + self.sep + corpus["text"][i]).strip()
                if "title" in corpus
                else corpus["text"][i].strip()
                for i in range(len(corpus["text"]))
            ]
        else:
            corpus = [
                (doc["title"] + self.sep + doc["text"]).strip() if "title" in doc else doc["text"].strip()
                for doc in corpus
            ]
        corpus = [corpus[idx:idx+batch_size] for idx in range(0,len(corpus),batch_size)]
        embeddings = []
        for c in corpus:
            inputs = self.context_tokenizer(c, padding=True, truncation=True, return_tensors='pt',max_length=512,).to(self.devcie)
            with torch.no_grad():
                embed = self.context_encoder(**inputs).pooler_output.cpu().numpy()
            embeddings.append(embed)
        return np.concatenate(embeddings,axis=0)

So I want to confirm if this is the right way to evaluate a dual encoder model?

If so, could you please consider merge these two PRs DPR Dragon with the updated results?

Mar 22 '24 04:03 Hannibal046

Thanks @Hannibal046 this looks very reasonable, will merge it in.

@imenelydiaker, @Muennighoff we talked about adding a scripts folder for running models, what did we end up deciding on that one?

Mar 22 '24 08:03 KennethEnevoldsen

Hello @Hannibal046 ,

Actually, I don't see the difference between the 2 tokenizers and 2 encoders you added in your implementation.

The tokenizers for queries and context are the same (BERTTokenizer, see docs here: https://huggingface.co/docs/transformers/main/en/model_doc/dpr#transformers.DPRQuestionEncoderTokenizer) The encoders both do a pooling just as SentenceTransformers does (look at the docs https://huggingface.co/docs/transformers/main/en/model_doc/dpr#transformers.DPRQuestionEncoderr). The library already handles dual encoders, a dual encoder is the same encoder used 2 times to embed two input sentences.

The only additional thing you're doing in the encoders is concatenating title and text to build a corpus. I think the MTEB tasks already do that, and that shouldn't be modified to make sure all models are evaluated on the same datasets.

After checking the results submitted in the previous PR and the new one, I see that you have the same results.

Something is wrong in the PR, your results should be added to the results folder not to the root:

Also, there will be duplicated results at the end (we should restart the leaderboard for them to apprear) because the results didn't overwrite the previous ones as they don't have the same folder name.

I suggest to remove this evaluation since the first evaluation using SentenceTransformers is correct.

Mar 22 '24 10:03 imenelydiaker

Hi, @imenelydiaker Thanks so much for the careful check! Actually, the two DPR model are different because DPR is not a universal text embedding model, it is specially designed for retrieval task and has two separate encoders. Here separate means two set of parameters, so even if they have same tokenizer and pooling method, for the same input, they would produce different output:

Sorry for mistakenly upload to wrong directory. But I am surprised that we got the same results. Is there something wrong with my implementation above? Could you please double check it? For the title concatenation part, I copy from here: https://github.com/embeddings-benchmark/mteb/blob/b08913f8616c580f8bbb15bfa808549e2b74912a/mteb/evaluation/evaluators/RetrievalEvaluator.py#L162-L176

Mar 22 '24 12:03 Hannibal046

Okay my bad then. Maybe you should reopen a PR and put the results in the old folder so we can see the differences in results ?

Mar 22 '24 13:03 imenelydiaker

Okay my bad then. Maybe you should reopen a PR and put the results in the old folder so we can see the differences in results ?

Sure

Mar 22 '24 13:03 Hannibal046

Actually, SentenceTransformers proposes two models: one for queries https://huggingface.co/sentence-transformers/facebook-dpr-question_encoder-single-nq-base and one for context : https://huggingface.co/sentence-transformers/facebook-dpr-ctx_encoder-multiset-base

Mar 22 '24 13:03 imenelydiaker

Can we use these two models simultaneously for MTEB? If so, it would be very convenient to benchmark dual encoder!

BTW, do I have to wait for the refresh of the leaderboard since I tried it myself a few times, but still couldn't find the Dragon results merged hours ago.

Mar 22 '24 13:03 Hannibal046

Can we use these two models simultaneously for MTEB? If so, it would be very convenient to benchmark dual encoder!

BTW, do I have to wait for the refresh of the leaderboard since I tried it myself a few times, but still couldn't find the Dragon results merged hours ago.

I don't thinks it's handled by MTEB, I think we use the same model for encoding queries and context. Maybe @Muennighoff can confirm this ?

I'll restart the leaderboard once this is merged https://huggingface.co/datasets/mteb/results/discussions/36, and the new results appear 🙂

Mar 22 '24 13:03 imenelydiaker

Thanks for your assistance! I have make several new PRs to meet community requirements. https://huggingface.co/datasets/mteb/results/discussions/38 https://huggingface.co/datasets/mteb/results/discussions/39 https://huggingface.co/datasets/mteb/results/discussions/40

Mar 22 '24 14:03 Hannibal046

Can we use these two models simultaneously for MTEB? If so, it would be very convenient to benchmark dual encoder! BTW, do I have to wait for the refresh of the leaderboard since I tried it myself a few times, but still couldn't find the Dragon results merged hours ago.

I don't thinks it's handled by MTEB, I think we use the same model for encoding queries and context. Maybe @Muennighoff can confirm this ?

I'll restart the leaderboard once this is merged https://huggingface.co/datasets/mteb/results/discussions/36, and the new results appear 🙂

You can use different models via encode_queries & encode_corpus as explained here

Actually to make merged results appear in the leaderboard, there's a few additional steps:

Merge PR into results
ONLY IF NEW MODEL: Add its name here
Recreate paths: Clone the results repo to have it locally, cd into it, open a Python interpreter, cp this function into it https://huggingface.co/datasets/mteb/results/blob/main/results.py#L164 & run it (Currently there is a problem with some files having the same name on Mac due to capitalization insensitivity, so you should rather clone it on a OS that distinguishes capitalization like Linux)
ONLY IF NEW MODEL: Add the Specs of the model to all EXTERNAL_... global variables in the leaderboard code i.e. here etc.
Restart the leaderboard space

I'm happy to do it if it's too complex, but maybe useful for everyone to know so you can also do it whenever you want :) Also if people have ideas for simplifying it, ofc welcome. Maybe 1., 2. & 3. can be automated somehow.

Mar 22 '24 14:03 Muennighoff

I believe this issue is resolved - will close it for now

Sep 09 '24 15:09 KennethEnevoldsen

mteb mteb copied to clipboard

Update results

mteb
mteb copied to clipboard