gpl icon indicating copy to clipboard operation
gpl copied to clipboard

base checkpoint selection

Open alsbhn opened this issue 2 years ago • 3 comments

I see in the code that two models (distilbert-base-uncased, msmarco-distilbert-margin-mse) are recommended to use as initial checkpoints. I tried to use other Sentence-Transformers models like all-mpnet-base-v2 but it didn't work. Is there a difference in the architecture of the models and the implementation out there? What models can be used here as initial checkpoints?

alsbhn avatar Jun 03 '22 07:06 alsbhn

Hi @alsbhn, could you please tell me what you mean by "didn't work"? Do you mean the code was not runnable with this setting or something about the performance?

kwang2049 avatar Jun 08 '22 19:06 kwang2049

The code works well and with no error. But the issue is with the performance. When I use "distilbert-base-uncased" or "msmarco-distilbert-margin-mse" as base checkpoint the performance increases after a couple of 10000 steps as expected but using other models like all-mpnet-base-v2 and all-MiniLM-L6-v2 the model does not perform well on my dataset and the performance even decreases as I train it for more steps.

alsbhn avatar Jun 09 '22 07:06 alsbhn

Thanks for pointing out this issue. I need some time to check what could be the exact reason. As I can imagine, there might be four potential reasons: (1) The base checkpoint might be already stronger than the teacher cross-encoder; (2) The training steps might be too few: For some target datasets, I found there could be degeneration at the beginning, but the final performance would be improved after longer training (e.g. 100K steps); (3) The negative miner might be too weak. For this, we can try setting base_ckpt and retrievers to the same checkpoint, e.g. sentence-transformers/all-mpnet-base-v2. From my experience, I found this is very important when we use TAS-B as the base checkpoint; (4) It might be due to the similarity function between dot product and cosine similarity. @nreimers recently found MarginMSE would result in poor in-domain performance if we use cosine similarity (compared with simple CrossEntropy loss). I am not sure whether this will be the same case for the domain-adaptation scenario. Note that both all-mpnet-base-v2 and all-MiniLM-L6-v2 were trained with cosine similarity.

kwang2049 avatar Jun 16 '22 10:06 kwang2049