unilm E5-model, dataset creation

Hello,

I would like to create a custom dataset following your approach for E5 model, so I have few questions regarding each step :

Retriever 1 : Hard mining is done by taking passage from BM25 on the custom dataset. What would be the number of negative samples selected ?
Retriever 2: Hard mining is done only on the dataset we want to fine tune. The selected negative samples are made by using samples which retriever 1 predict with a high score (=difficult examples only). How much negative samples to save ?
Reranker : a fix number of negative samples are selected based on cosine similarity between query and passage from retriever2. (highest score again). What is that number?
Retriever distill : score of teacher are computed from reranker model (cross encoder). However how the hard negative sample have been created? Are they the same than the one use for Retriever 2?

Thank you for your help!

Jul 04 '23 15:07 ludovic-k

For BM25, we sample 200 negatives from the top-1000 retrieved passages.
For Retriever 2, we use top-200 samples from retriever 1 as hard negatives.
For Reranker, again, it is top-200 samples from retriever 2.
Yes, the hard negative samples come from Retriever 2.

The related code is at https://github.com/microsoft/unilm/blob/82a002a29ae0cc9b1b1e38805b44d169101f41c3/simlm/scripts/search_marco.sh#L21-L24

Also, I'd like to point out that this hyperparameter does not affect the performance much. Sampling from the top 100, 200, or 500 passages should lead to comparable results.

Jul 05 '23 06:07 intfloat

@intfloat Hi, thanks for the answer. I would like to ask another question. Do I need to add "query: " and "passage: " in the text of the query and passage in the dataset, as it is used in this way?

Jul 10 '23 03:07 wangyiran33

If you want to use the released model without fine-tuning, then you should add "query: " and "passage: " prefix, otherwise it is optional.

Jul 10 '23 03:07 intfloat

If you want to use the released model without fine-tuning, then you should add "query: " and "passage: " prefix, otherwise it is optional.

thx. I'd like to finetune the released E5 model, I think I should add "query: " and "passage: " prefix

Jul 10 '23 06:07 wangyiran33

one last question, is there any plan to release also the cross encoder? @intfloat

Jul 19 '23 12:07 ludovic-k

one last question, is there any plan to release also the cross encoder? @intfloat

The cross encoder for ms-marco is available at https://github.com/microsoft/unilm/tree/master/simlm#available-models, we do not plan to release cross encoders for other datasets, but the training procedure is similar, as described in https://github.com/microsoft/unilm/tree/master/simlm#train-a-cross-encoder-re-ranker.

Jul 20 '23 04:07 intfloat

unilm unilm copied to clipboard

E5-model, dataset creation

unilm
unilm copied to clipboard