tvsub Chinese Sentences in train.en

Chinese Sentences in train.en

Open PolarLion opened this issue 6 years ago • 1 comments

Hi, I found some Chinese sentences (about 4000 sentences) in train.en file. for example

I'm not sure if these bugs will affect other parallel data.

Thanks

Apr 22 '18 01:04 PolarLion

Hi,

Thanks for pointing that.

As the corpus is automatically extracted from bilingual subtitles, there would be some noise in training data. You could directly filter this kind of sentences on both sides. Considering 2M sentence pairs in training data, these 4K sentences will not affect the model too much.

We will also keep on cleaning the data, and release them in the next version.

Cheers, Longyue

Apr 23 '18 11:04 longyuewangdcu

tvsub tvsub copied to clipboard

Chinese Sentences in train.en

tvsub
tvsub copied to clipboard