nmt-en-vi
nmt-en-vi copied to clipboard
Issue with English-Vietnamese dataset alignment
Hi @stefan-it , I just download IWSLT 15 English Vietnamese dataset and i saw some blank in both files. So I tried to remove all blank lines with Notepad++. Then I saw the number sentences of train.en and train.vi is not equal, 133168 sents for train.en and 133205 for train.vi
Hi @lengockyquang,
I checked the training file and a wc -l train.en
yields to a line number of 133.317 (both for the train.vi
file). I think something is wrong with the Notepad++ display (maybe some issues with line breaks).
But could you just give some examples of empty lines? I'll check it then :)
I've checked some empty lines and realized that there are some weird cases that on source sentences are empty lines but on target sentences are not.
I think this is reason that when we remove blank lines on both file, it leads to mis-align between them.
Hello, thanks lengockyqang. When we know the cause, then the fix is easy.
def align(inpt, trgt):
x = inpt.split('\n')
y = trgt.split('\n')
i = 0
while i < len(x):
if len(x[i]) < 2 or len(y[i]) < 2:
x.pop(i)
y.pop(i)
else: i+=1
assert len(x) == len(y)
return x,y
x,y = align(inpt, trgt)
print(x[-3], y[-3])
>> thank you very much for your time rất cảm ơn đã lắng nghe