undreamt icon indicating copy to clipboard operation
undreamt copied to clipboard

Some questions...

Open lulu0-0 opened this issue 6 years ago • 5 comments

Hello. I ran the experiment on nearly 30w Tibet-Chinese corpus and the result is sooo bad. (Most translated text can be read smoothly but they are totally irrelevant to the source text.

I did the experiment according to your paper, using BPE and Vecmap(objective nearly 34%). Can I ask how large is your training corpus? I wonder if it is because of my corpus is not big enough, or there's something wrong with mapping?

Thanks again in advance!

lulu0-0 avatar Apr 09 '18 06:04 lulu0-0

The size of our training corpora is as follows:

  • Spanish: 386 million tokens
  • French: 749 million tokens
  • German: 1.606 million tokens
  • English: 2.238 million tokens

What do you mean by "30w"? What is the size of your corpus in tokens?

artetxem avatar Apr 09 '18 07:04 artetxem

That is 300,000 sentences,4 million words after text segmentation.

lulu0-0 avatar Apr 09 '18 08:04 lulu0-0

That might be too little (in fact, 300k sentences would be very little to train an NMT system even if they were parallel). In any case, the problem might not be with UNdreaMT but with VecMap, so you should first check whether your bilingual embeddings make sense or not.

artetxem avatar Apr 09 '18 10:04 artetxem

Thank you very much! I 'll check it.

lulu0-0 avatar Apr 09 '18 10:04 lulu0-0

How do you validate your model during training ?

amitk526188 avatar Jul 03 '19 06:07 amitk526188