unilm icon indicating copy to clipboard operation
unilm copied to clipboard

The difference between multilingual-e5-base and e5-base

Open HAOChuzhan opened this issue 1 year ago • 4 comments

I am using the model of multilingual-e5-base, which has a good effect on the Chinese datasets. Thank you very much for your approach!

Therefore, I'd like to ask you some questions.

  • Do both models have the same two-stage training?
  • What are the specific differences between the training data of the two stages for both models?
  • If I want to fine-tune on a larger model (chinese-roberta-large), how can I achieve the effect of your multilingual-e5-base model here?

I would be very grateful if author could answer my doubts! 😊

HAOChuzhan avatar Jun 05 '23 10:06 HAOChuzhan

For your questions:

  • Do both models have the same two-stage training? Yes, the techniques are the same, but the data is different. The first stage is contrastive pre-training, and the second stage is supervised fine-tuning.

  • What are the specific differences between the training data of the two stages for both models? Multilingual-e5 models use multilingual data for both stages, while e5-base only uses English data.

  • If I want to fine-tune on a larger model (chinese-roberta-large), how can I achieve the effect of your multilingual-e5-base model here? You need to collect many Chinese text pairs, and then follow our paper to do two-stage training. This is generally a time-consuming process.

intfloat avatar Jun 06 '23 04:06 intfloat

Thanks for your reply! I found that only Multilingual-E5-base model is provided on HuggingFace. Whether the Multilingual-E5-large version has been open? If so, could you please provide me with the checkpoints of Multilingual-E5-large?

HAOChuzhan avatar Jun 07 '23 02:06 HAOChuzhan

We'll release multilingual-e5-large checkpoint, but it will take some time, perhaps weeks.

intfloat avatar Jun 07 '23 04:06 intfloat

I am ansious to test the new release.

wilfoderek avatar Jun 21 '23 16:06 wilfoderek