Condenser
Condenser copied to clipboard
reproducing your results on MS MARCO
Hi,
Thank you for your great work!
I am willing to replicate your results on MS MARCO passage collection and I have a question regarding Luyu/co-condenser-marco
model. Is this the final model that you used to retrieve documents? Or do I need to train it on MS MARCO relevant query/passage pairs?
Is it possible to provide a little bit more detail on how should I use your dense toolkit with this model?
Thank you in advance!
Hello,
Please take a look at the coCondenser fine-tuning tutorial. It should answer most of your questions.
We can leave this issue open for now in case you run into other problems.
Thank you for the great tutorial !
Just one issue that I have found is
--passage_reps corpus/corpus/'*.pt'
should be --passage_reps encoding/corpus/'*.pt'
in this link https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco#index-search
Thanks for catching that!
Hi,
I was able to replicate the MRR@10 that you reported in the paper ( 0.38) but I was wondering what is the difference between the number that is reported on the leaderboard ( 0.44) vs 0.38? How do I replicate that? is it on a different set?
Hi, @luyug
Thanks for your awesome work. I have similar question on NQ. Is it possible to give more details to reproduce the results (84.3=MRR@5) on NQ in the paper, just like the detailed MS MARCO tutorial demo?
Or if it need some time, could you tell me whether your SOTA model on NQ is trained with mined hard negatives or with both BM hard negatives and mined hard negatives as DPR github?
Thanks.
Hi @luyug,
Thanks for your great work! I also have the confusion about the difference between the reported result and leaderboard (0.38 vs. 0.44). Is there any update on this?
Also interested, from what I remember the main difference is that there's also a reranker applied, would it be possible to get the checkpoint of the reranker?
Hi, Thank you for your great work! I encounter some issues when I tried to reproduce the results on MARCO passage. I have referred to the aforementioned tutorial, but still cannot solve it (the problem seems to be in the step of mining hard negatives).
First, I run Fine-tuning Stage 1 with
CUDA_VISIBLE_DEVICES=3 python -m tevatron.driver.train \
--output_dir model_msmarco_s1 \
--model_name_or_path ../data/co-condenser-marco \
--save_steps 20000 \
--train_dir ../data/msmarco-passage/train_dir \
--data_cache_dir ../data/msmarco-passage-train-cache \
--fp16 \
--dataloader_num_workers 2 \
--per_device_train_batch_size 8 \
--train_n_passages 8 \
--learning_rate 5e-6 \
--q_max_len 16 \
--p_max_len 128 \
--num_train_epochs 3 \
--logging_steps 500 \
, and get MRR@10=0.3596, R@1000=0.9771. (Your reported results are MRR@10=0.357, R@1000=0.978).
Then, I run the hard negative mining with random sampling 30 negatives from the top-200 retrieval results of model_msmarco_s1
by modifying scripts/hn_mining.py (according to the parameters in build_train_hn.py).
Second, I run Fine-tuning Stage 2 with
CUDA_VISIBLE_DEVICES=3 python -m tevatron.driver.train \
--output_dir model_msmarco_s2 \
--model_name_or_path ../data/co-condenser-marco \
--save_steps 20000 \
--train_dir ../data/msmarco-passage/tain_dir_hn_dr_cocondenser200 \
--data_cache_dir ../data/msmarco-passage-tain_hn_dr_cocondenser200-cache \
--fp16 \
--dataloader_num_workers 2 \
--per_device_train_batch_size 8 \
--train_n_passages 8 \
--learning_rate 5e-6 \
--q_max_len 16 \
--p_max_len 128 \
--num_train_epochs 2 \
--logging_steps 500 \
, and get MRR@10=0.3657, R@1000=0.9761. (Your reported results are MRR@10=0.382, R@1000=0.984).
There are several possible issues that I would like to confirm:
- The training data for Fine-tuning Stage 2 only is hard negatives, having not been concatenated with BM25 negatives?
- The initial parameters are from
co-condenser-marco
, not the checkpoint ofmodel_msmarco_s1
? - The setting of
per_device_train_batch_size
,train_n_passages
,learning_rate
, andnum_train_epochs
in Fine-tuning Stage 2 ?
Thank you in advance!