splade icon indicating copy to clipboard operation
splade copied to clipboard

Inquiry about Configuration Details for "ecir23-scratch-tydi-japanese-splade" Model

Open kuro96al opened this issue 1 year ago • 4 comments

Hello, I am currently developing a Japanese model and have been referencing the "ecir23-scratch-tydi-japanese-splade" model on Hugging Face for guidance. I would greatly appreciate it if you could share the specific settings, including the models and datasets used, to create this model. This information will be incredibly helpful for my project. Thank you in advance for your assistance.

url:https://huggingface.co/naver/ecir23-scratch-tydi-japanese-splade

kuro96al avatar Feb 19 '24 02:02 kuro96al

Hi @kuro96al,

we pretrained the model from scratch using the japanese Mr.TyDi corpus (https://github.com/castorini/mr.tydi), we then trained with a contrastive loss using japanese MMARCO (https://github.com/unicamp-dl/mMARCO) and finally finetuned with the japanese Mr.TyDi train query set.

The model is based on a distilbert (6L, 768Hidden dims), but as said previously, the model is initialized randomly and then trained as described in the previous paragraph.

For more information here's a paper talking about the strategies we used to develop that model and what we were looking for: https://arxiv.org/pdf/2301.10444.pdf

carlos-lassance avatar Feb 19 '24 09:02 carlos-lassance

Thank you for your response. Is the pre-trained model uploaded on platforms like Hugging Face?

kuro96al avatar Feb 24 '24 15:02 kuro96al

We attempted to train SPLADE based on the model found at https://huggingface.co/line-corporation/line-distilbert-base-japanese/tree/main, but it seems that there were issues with the vocabulary that prevented successful training.

kuro96al avatar Feb 24 '24 15:02 kuro96al

Thank you for your response. Is the pre-trained model uploaded on platforms like Hugging Face?

Unfortunately it is not, not sure if we still have it...

We attempted to train SPLADE based on the model found at https://huggingface.co/line-corporation/line-distilbert-base-japanese/tree/main, but it seems that there were issues with the vocabulary that prevented successful training.

Yeah, we found similar problems with a ton of models, that's one of the reasons we went with training a model from scratch.

carlos-lassance avatar Feb 28 '24 07:02 carlos-lassance