sent2vec
sent2vec copied to clipboard
[Improvement]: Add french and german pre-trained models
Hi,
First of all thanks for the great work!
I have trained unigrams models on the wikipedia corpus in german and french and would like to share them. German model is 7.3GB, french model is 4.4GB.
Both models have been trained on the latest (preprocessed) wiki dumps with the parameters found in the paper for training "Wiki Sent2Vec unigrams" models (dim:600, minCount:8, minCountLabel:20, lr:0.2, epoch:9, t:0.00001, dropoutK:0, neg:10). Let me know if you're interested and and if that's the case where I can upload them
Hi! Thanks a lot for training those models :) ! It could be interesting to propose them but there should be a way to evaluate their performances. Do you know any french and german supervised and/or unsupervised tasks we could use to benchmark those embeddings ?
Hi! I must say I am new to the world of word/sentence embeddings and do not know much about common evaluation methods/datasets for these languages. A quick search returned some datasets for "28 monolingual word similarity tasks for 6 languages" (the data/get_evaluation.sh
script allows to download datasets for german and french language amongst others) and some syntactic and semantic evaluation datasets for german (although these do not seem to be official benchmarks). German and french datasets can also be found for Task 2 of the official SemEval-2017 evaluation framework.
Hi, Sorry for the late reply. Can you share the models? (preferably on Google drive or Dropbox). We'll try to do the evaluations using some downstream supervised tasks. We can't use word similarity tasks for benchmarking our sentence embeddings obtained by averaging. Although we can use them to evaluate the robustness of the word embeddings.
Sure! Here they are: https://drive.google.com/file/d/199WZvUYTDaOl-xAwhLowVNFFdv_2eiXF/view?usp=sharing. The tar archive contains two files: fr_model.bin
and de_model.bin
Hi @a-pagano , Thank you. We will evaluate the models and come back to you soon.
Hi Sorry, Can we have the result of your evaluation and the task used?
We have tested the Fr model and the results were not that good, could you please share your results?