tvsub
tvsub copied to clipboard
Chinese Sentences in train.en
Hi, I found some Chinese sentences (about 4000 sentences) in train.en file. for example
I'm not sure if these bugs will affect other parallel data.
Thanks
Hi,
Thanks for pointing that.
As the corpus is automatically extracted from bilingual subtitles, there would be some noise in training data. You could directly filter this kind of sentences on both sides. Considering 2M sentence pairs in training data, these 4K sentences will not affect the model too much.
We will also keep on cleaning the data, and release them in the next version.
Cheers, Longyue