tvsub icon indicating copy to clipboard operation
tvsub copied to clipboard

Chinese Sentences in train.en

Open PolarLion opened this issue 6 years ago • 1 comments

Hi, I found some Chinese sentences (about 4000 sentences) in train.en file. for example

image

I'm not sure if these bugs will affect other parallel data.

Thanks

PolarLion avatar Apr 22 '18 01:04 PolarLion

Hi,

Thanks for pointing that.

As the corpus is automatically extracted from bilingual subtitles, there would be some noise in training data. You could directly filter this kind of sentences on both sides. Considering 2M sentence pairs in training data, these 4K sentences will not affect the model too much.

We will also keep on cleaning the data, and release them in the next version.

Cheers, Longyue

longyuewangdcu avatar Apr 23 '18 11:04 longyuewangdcu