undreamt
undreamt copied to clipboard
Some questions...
Hello. I ran the experiment on nearly 30w Tibet-Chinese corpus and the result is sooo bad. (Most translated text can be read smoothly but they are totally irrelevant to the source text.
I did the experiment according to your paper, using BPE and Vecmap(objective nearly 34%). Can I ask how large is your training corpus? I wonder if it is because of my corpus is not big enough, or there's something wrong with mapping?
Thanks again in advance!
The size of our training corpora is as follows:
- Spanish: 386 million tokens
- French: 749 million tokens
- German: 1.606 million tokens
- English: 2.238 million tokens
What do you mean by "30w"? What is the size of your corpus in tokens?
That is 300,000 sentences,4 million words after text segmentation.
That might be too little (in fact, 300k sentences would be very little to train an NMT system even if they were parallel). In any case, the problem might not be with UNdreaMT but with VecMap, so you should first check whether your bilingual embeddings make sense or not.
Thank you very much! I 'll check it.
How do you validate your model during training ?