undreamt Some questions...

Some questions...

Open lulu0-0 opened this issue 6 years ago • 5 comments

Hello. I ran the experiment on nearly 30w Tibet-Chinese corpus and the result is sooo bad. (Most translated text can be read smoothly but they are totally irrelevant to the source text.

I did the experiment according to your paper, using BPE and Vecmap(objective nearly 34%). Can I ask how large is your training corpus? I wonder if it is because of my corpus is not big enough, or there's something wrong with mapping?

Thanks again in advance!

Apr 09 '18 06:04 lulu0-0

The size of our training corpora is as follows:

Spanish: 386 million tokens
French: 749 million tokens
German: 1.606 million tokens
English: 2.238 million tokens

What do you mean by "30w"? What is the size of your corpus in tokens?

Apr 09 '18 07:04 artetxem

That is 300,000 sentences，4 million words after text segmentation.

Apr 09 '18 08:04 lulu0-0

That might be too little (in fact, 300k sentences would be very little to train an NMT system even if they were parallel). In any case, the problem might not be with UNdreaMT but with VecMap, so you should first check whether your bilingual embeddings make sense or not.

Apr 09 '18 10:04 artetxem

Thank you very much! I 'll check it.

Apr 09 '18 10:04 lulu0-0

How do you validate your model during training ?

Jul 03 '19 06:07 amitk526188

undreamt undreamt copied to clipboard

Some questions...

undreamt
undreamt copied to clipboard