unilm
unilm copied to clipboard
E5-model, dataset creation
Hello,
I would like to create a custom dataset following your approach for E5 model, so I have few questions regarding each step :
-
Retriever 1 : Hard mining is done by taking passage from BM25 on the custom dataset. What would be the number of negative samples selected ?
-
Retriever 2: Hard mining is done only on the dataset we want to fine tune. The selected negative samples are made by using samples which retriever 1 predict with a high score (=difficult examples only). How much negative samples to save ?
-
Reranker : a fix number of negative samples are selected based on cosine similarity between query and passage from retriever2. (highest score again). What is that number?
-
Retriever distill : score of teacher are computed from reranker model (cross encoder). However how the hard negative sample have been created? Are they the same than the one use for Retriever 2?
Thank you for your help!
- For BM25, we sample 200 negatives from the top-1000 retrieved passages.
- For Retriever 2, we use top-200 samples from retriever 1 as hard negatives.
- For Reranker, again, it is top-200 samples from retriever 2.
- Yes, the hard negative samples come from Retriever 2.
The related code is at https://github.com/microsoft/unilm/blob/82a002a29ae0cc9b1b1e38805b44d169101f41c3/simlm/scripts/search_marco.sh#L21-L24
Also, I'd like to point out that this hyperparameter does not affect the performance much. Sampling from the top 100, 200, or 500 passages should lead to comparable results.
@intfloat Hi, thanks for the answer. I would like to ask another question. Do I need to add "query: " and "passage: " in the text of the query and passage in the dataset, as it is used in this way?
If you want to use the released model without fine-tuning, then you should add "query: " and "passage: " prefix, otherwise it is optional.
If you want to use the released model without fine-tuning, then you should add "query: " and "passage: " prefix, otherwise it is optional.
thx. I'd like to finetune the released E5 model, I think I should add "query: " and "passage: " prefix
one last question, is there any plan to release also the cross encoder? @intfloat
one last question, is there any plan to release also the cross encoder? @intfloat
The cross encoder for ms-marco is available at https://github.com/microsoft/unilm/tree/master/simlm#available-models, we do not plan to release cross encoders for other datasets, but the training procedure is similar, as described in https://github.com/microsoft/unilm/tree/master/simlm#train-a-cross-encoder-re-ranker.