nmt-en-vi icon indicating copy to clipboard operation
nmt-en-vi copied to clipboard

Issue with English-Vietnamese dataset alignment

Open lengockyquang opened this issue 5 years ago • 3 comments

Hi @stefan-it , I just download IWSLT 15 English Vietnamese dataset and i saw some blank in both files. So I tried to remove all blank lines with Notepad++. Then I saw the number sentences of train.en and train.vi is not equal, 133168 sents for train.en and 133205 for train.vi

lengockyquang avatar Jul 14 '19 15:07 lengockyquang

Hi @lengockyquang,

I checked the training file and a wc -l train.en yields to a line number of 133.317 (both for the train.vi file). I think something is wrong with the Notepad++ display (maybe some issues with line breaks).

But could you just give some examples of empty lines? I'll check it then :)

stefan-it avatar Jul 14 '19 22:07 stefan-it

I've checked some empty lines and realized that there are some weird cases that on source sentences are empty lines but on target sentences are not.

image

I think this is reason that when we remove blank lines on both file, it leads to mis-align between them.

lengockyquang avatar Jul 15 '19 02:07 lengockyquang

Hello, thanks lengockyqang. When we know the cause, then the fix is easy.

def align(inpt, trgt):
    x = inpt.split('\n')
    y = trgt.split('\n')

    i = 0
    while i < len(x):
        if len(x[i]) < 2 or len(y[i]) < 2:
            x.pop(i)
            y.pop(i)
        else: i+=1
    
    assert len(x) == len(y)
    return x,y

x,y = align(inpt, trgt)
print(x[-3], y[-3])
>> thank you very much for your time  rất cảm ơn đã lắng nghe 

huybik avatar Sep 25 '21 12:09 huybik