vecmap icon indicating copy to clipboard operation
vecmap copied to clipboard

low accuracy

Open 15091444119 opened this issue 7 years ago • 6 comments

I get only 10% accuracy on EN-DE using WMT16 as training data. The identical and unsupervised method does not differ much.

15091444119 avatar Jul 30 '18 03:07 15091444119

How can I improve it?

15091444119 avatar Jul 30 '18 03:07 15091444119

What do you mean by "using WMT16 as training data"?

I tried the unsupervised command in this repo on FastText embeddings of EN and DE last time, it works well. At least some over 50% accuracy on MUSE EN-DE bilingual dictionary.

zhangxiangnick avatar Jul 30 '18 15:07 zhangxiangnick

I mean use wmt16 corpus to train word2vec.

I have found my bug and got 40% accuracy on MUSE EN-DE test dictionary. What's your training corpus?Is FastText better than word2vec?

15091444119 avatar Jul 31 '18 00:07 15091444119

I didn't train my own embeddings. I used FastText pre-trained embeddings.

zhangxiangnick avatar Aug 05 '18 02:08 zhangxiangnick

Word2vec embeddings are purely co-occurrence-based, whereas fasttext embeddings additionally take into account character information. Therefore it is hard to directly compare them in general context.

hassyGo avatar Aug 16 '18 21:08 hassyGo

@artetxem I've used a similar approach using ELMO word embedding. I have two almost identical vocab files in English which I extracted their embeddings using ELMO. I just wanted to try out this library and see how it find matches between these two almost identical files as follows:

python3 map_embeddings.py --identical SRC.EMB SEMI-SRC.EMB SRC_MAPPED.EMB TRG_MAPPED.EMB

And then tried out to find the similarities of a few simple english words like (was, she, is, the) using the shared embeddings by this command: python3 eval_translation.py SRC_MAPPED.EMB TRG_MAPPED.EMB -d TEST.DICT

but the accuracy was 0.0% for me!!

Also, another question, why the resulting shared embeddings for target embedding has the same words as the SRC.EMB embedding file? I'm not sure how we can use the TRG_MAPPED.EMB file for instance for a Dutch text if it contains the same words from SRC.EMB (in English). I think I'm missing something, here.

yaserkl avatar Aug 31 '18 05:08 yaserkl