sent2vec icon indicating copy to clipboard operation
sent2vec copied to clipboard

[Improvement]: Add french and german pre-trained models

Open a-pagano opened this issue 6 years ago • 7 comments

Hi,

First of all thanks for the great work!

I have trained unigrams models on the wikipedia corpus in german and french and would like to share them. German model is 7.3GB, french model is 4.4GB.

Both models have been trained on the latest (preprocessed) wiki dumps with the parameters found in the paper for training "Wiki Sent2Vec unigrams" models (dim:600, minCount:8, minCountLabel:20, lr:0.2, epoch:9, t:0.00001, dropoutK:0, neg:10). Let me know if you're interested and and if that's the case where I can upload them

a-pagano avatar Apr 17 '18 07:04 a-pagano

Hi! Thanks a lot for training those models :) ! It could be interesting to propose them but there should be a way to evaluate their performances. Do you know any french and german supervised and/or unsupervised tasks we could use to benchmark those embeddings ?

mpagli avatar Apr 17 '18 12:04 mpagli

Hi! I must say I am new to the world of word/sentence embeddings and do not know much about common evaluation methods/datasets for these languages. A quick search returned some datasets for "28 monolingual word similarity tasks for 6 languages" (the data/get_evaluation.sh script allows to download datasets for german and french language amongst others) and some syntactic and semantic evaluation datasets for german (although these do not seem to be official benchmarks). German and french datasets can also be found for Task 2 of the official SemEval-2017 evaluation framework.

a-pagano avatar Apr 17 '18 14:04 a-pagano

Hi, Sorry for the late reply. Can you share the models? (preferably on Google drive or Dropbox). We'll try to do the evaluations using some downstream supervised tasks. We can't use word similarity tasks for benchmarking our sentence embeddings obtained by averaging. Although we can use them to evaluate the robustness of the word embeddings.

guptaprkhr avatar Jun 13 '18 08:06 guptaprkhr

Sure! Here they are: https://drive.google.com/file/d/199WZvUYTDaOl-xAwhLowVNFFdv_2eiXF/view?usp=sharing. The tar archive contains two files: fr_model.bin and de_model.bin

a-pagano avatar Jun 13 '18 09:06 a-pagano

Hi @a-pagano , Thank you. We will evaluate the models and come back to you soon.

guptaprkhr avatar Jun 27 '18 11:06 guptaprkhr

Hi Sorry, Can we have the result of your evaluation and the task used?

laleye avatar Jul 12 '19 16:07 laleye

We have tested the Fr model and the results were not that good, could you please share your results?

adelra avatar Nov 21 '19 02:11 adelra