vecmap the objective is normal?

the objective is normal?

Open ZhenYangIACAS opened this issue 6 years ago • 11 comments

I ran the code on my dataset, and the objective I got is 32.5354% after 67 iteration. It is normal? How should I finetune the parameters?

Nov 07 '17 10:11 ZhenYangIACAS

That depends entirely on your dataset. It seems a bit low compared to what I usually get, but it could be reasonable in your case. The only way to know it is to somehow evaluate your embeddings, although manually checking the nearest neighbors of a few words is enough to check that the system is learning something.

The mapping method itself does not have any hyperparameter, so there is nothing to explore there. However, you may want to tune the hyperparameters of the embeddings themselves, try different normalization options, or play with the training corpus and dictionary, which could all make a considerable difference.

Nov 07 '17 11:11 artetxem

I manually build a dictionary containing several word pairs for the translation test. The coverage is 100% and the accuracy is 0. Why the accuracy is 0.

Nov 07 '17 11:11 ZhenYangIACAS

I obviously don't know if you don't give more details. What was your training set (language pair, corpus, embeddings, dictionary...)? What commands did you run to learn the mapping and evaluate it?

Nov 07 '17 12:11 artetxem

language pair is English to Chinese, corpus contains 200w sentences. dictionary only contains five word pairs. I run with the command "python3 eval_translation.py train.en.txt.remBlank.tok.bpe.lf.50.mono.vectors.normalized.mapped train.zh.seg.txt.remBlank.bpe.lf.50.mono.vectors.normalized.mapped -d test_dic"

Nov 07 '17 12:11 ZhenYangIACAS

the test_dict is: word 词语 I 我 you 他 hello 你好 hi 你好 thanks 谢谢 word 词 I 我们 And the mapped embedding is got according to the example in README

Nov 07 '17 12:11 ZhenYangIACAS

So the embeddings were trained in only 200 sentences? That's way too little to get anything reasonable. The training dictionary of only 5 word pairs seems too small as well. In our paper we report positive results starting at 25 word pairs.

Nov 07 '17 13:11 artetxem

@artetxem No, the embeddings are trained in 200w(2000000) sentences. I have expanded the dictionary to 25 words, the accuracy is still 0. Maybe may test dictionary is still too small?

Nov 08 '17 00:11 ZhenYangIACAS

Your test dictionary is indeed very small, and it might be that you also need a larger training dictionary for English-Chinese. I would also recommend you to try the numeral-based initialization, I would expect it to be more robust assuming that there are arabic numerals in the Chinese training corpus. Also, how did you train your embeddings? What is your vocabulary size?

Nov 09 '17 10:11 artetxem

@ artetxem Yes, I am utilizing the nemeral-based initialization and the vocabulary size for our model is 30000. I will test it with a bigger test dictionary. Thank you .

Nov 10 '17 06:11 ZhenYangIACAS

@ZhenYangIACAS Hi, Have you solved the problem?

Sep 23 '18 04:09 liujiqiang999

@ZhenYangIACAS @JiqiangLiu 运行命令行示例（无监督训练 en2zh 时, 传递命令行参数 --unsupervised_vocab 8000 才能得到比较好的效果）： python map_embeddings.py --unsupervised --unsupervised_vocab 8000 ./jy_data/model_en.vec ./jy_data/model_zh_j.vec ./jy_data/model_en_mapped2.vec ./jy_data/model_zh_j_mapped2.vec --cuda

May 23 '20 06:05 IT-coach-666

vecmap vecmap copied to clipboard

the objective is normal?

vecmap
vecmap copied to clipboard