gpl base checkpoint selection

I see in the code that two models (distilbert-base-uncased, msmarco-distilbert-margin-mse) are recommended to use as initial checkpoints. I tried to use other Sentence-Transformers models like all-mpnet-base-v2 but it didn't work. Is there a difference in the architecture of the models and the implementation out there? What models can be used here as initial checkpoints?

Jun 03 '22 07:06 alsbhn

Hi @alsbhn, could you please tell me what you mean by "didn't work"? Do you mean the code was not runnable with this setting or something about the performance?

Jun 08 '22 19:06 kwang2049

The code works well and with no error. But the issue is with the performance. When I use "distilbert-base-uncased" or "msmarco-distilbert-margin-mse" as base checkpoint the performance increases after a couple of 10000 steps as expected but using other models like all-mpnet-base-v2 and all-MiniLM-L6-v2 the model does not perform well on my dataset and the performance even decreases as I train it for more steps.

Jun 09 '22 07:06 alsbhn

Thanks for pointing out this issue. I need some time to check what could be the exact reason. As I can imagine, there might be four potential reasons: (1) The base checkpoint might be already stronger than the teacher cross-encoder; (2) The training steps might be too few: For some target datasets, I found there could be degeneration at the beginning, but the final performance would be improved after longer training (e.g. 100K steps); (3) The negative miner might be too weak. For this, we can try setting base_ckpt and retrievers to the same checkpoint, e.g. sentence-transformers/all-mpnet-base-v2. From my experience, I found this is very important when we use TAS-B as the base checkpoint; (4) It might be due to the similarity function between dot product and cosine similarity. @nreimers recently found MarginMSE would result in poor in-domain performance if we use cosine similarity (compared with simple CrossEntropy loss). I am not sure whether this will be the same case for the domain-adaptation scenario. Note that both all-mpnet-base-v2 and all-MiniLM-L6-v2 were trained with cosine similarity.

Jun 16 '22 10:06 kwang2049

gpl gpl copied to clipboard

base checkpoint selection

gpl
gpl copied to clipboard