vecmap icon indicating copy to clipboard operation
vecmap copied to clipboard

the objective is normal?

Open ZhenYangIACAS opened this issue 6 years ago • 11 comments

I ran the code on my dataset, and the objective I got is 32.5354% after 67 iteration. It is normal? How should I finetune the parameters?

ZhenYangIACAS avatar Nov 07 '17 10:11 ZhenYangIACAS

That depends entirely on your dataset. It seems a bit low compared to what I usually get, but it could be reasonable in your case. The only way to know it is to somehow evaluate your embeddings, although manually checking the nearest neighbors of a few words is enough to check that the system is learning something.

The mapping method itself does not have any hyperparameter, so there is nothing to explore there. However, you may want to tune the hyperparameters of the embeddings themselves, try different normalization options, or play with the training corpus and dictionary, which could all make a considerable difference.

artetxem avatar Nov 07 '17 11:11 artetxem

I manually build a dictionary containing several word pairs for the translation test. The coverage is 100% and the accuracy is 0. Why the accuracy is 0.

ZhenYangIACAS avatar Nov 07 '17 11:11 ZhenYangIACAS

I obviously don't know if you don't give more details. What was your training set (language pair, corpus, embeddings, dictionary...)? What commands did you run to learn the mapping and evaluate it?

artetxem avatar Nov 07 '17 12:11 artetxem

language pair is English to Chinese, corpus contains 200w sentences. dictionary only contains five word pairs. I run with the command "python3 eval_translation.py train.en.txt.remBlank.tok.bpe.lf.50.mono.vectors.normalized.mapped train.zh.seg.txt.remBlank.bpe.lf.50.mono.vectors.normalized.mapped -d test_dic"

ZhenYangIACAS avatar Nov 07 '17 12:11 ZhenYangIACAS

the test_dict is: word 词语 I 我 you 他 hello 你好 hi 你好 thanks 谢谢 word 词 I 我们 And the mapped embedding is got according to the example in README

ZhenYangIACAS avatar Nov 07 '17 12:11 ZhenYangIACAS

So the embeddings were trained in only 200 sentences? That's way too little to get anything reasonable. The training dictionary of only 5 word pairs seems too small as well. In our paper we report positive results starting at 25 word pairs.

artetxem avatar Nov 07 '17 13:11 artetxem

@artetxem No, the embeddings are trained in 200w(2000000) sentences. I have expanded the dictionary to 25 words, the accuracy is still 0. Maybe may test dictionary is still too small?

ZhenYangIACAS avatar Nov 08 '17 00:11 ZhenYangIACAS

Your test dictionary is indeed very small, and it might be that you also need a larger training dictionary for English-Chinese. I would also recommend you to try the numeral-based initialization, I would expect it to be more robust assuming that there are arabic numerals in the Chinese training corpus. Also, how did you train your embeddings? What is your vocabulary size?

artetxem avatar Nov 09 '17 10:11 artetxem

@ artetxem Yes, I am utilizing the nemeral-based initialization and the vocabulary size for our model is 30000. I will test it with a bigger test dictionary. Thank you .

ZhenYangIACAS avatar Nov 10 '17 06:11 ZhenYangIACAS

@ZhenYangIACAS Hi, Have you solved the problem?

liujiqiang999 avatar Sep 23 '18 04:09 liujiqiang999

@ZhenYangIACAS @JiqiangLiu 运行命令行示例(无监督训练 en2zh 时, 传递命令行参数 --unsupervised_vocab 8000 才能得到比较好的效果): python map_embeddings.py --unsupervised --unsupervised_vocab 8000 ./jy_data/model_en.vec ./jy_data/model_zh_j.vec ./jy_data/model_en_mapped2.vec ./jy_data/model_zh_j_mapped2.vec --cuda

IT-coach-666 avatar May 23 '20 06:05 IT-coach-666