ua-gec Add generated M^2 files

Add generated M^2 files

Open pavlo-kuchmiichuk opened this issue 2 years ago • 1 comments

This PR contains a couple of changes:

adds means to create a new data representation annotated-source-sentences - these are the split source sentences with annotations from the source document present;
adds scripts to automatically create M^2 files necessary for evaluation.

M^2 files derive from the existing data representations, thus if the annotations/tokenization/sentence splitting is faulty, there might be mistakes in M^2 representation as well. Note that currently, running m2-scorer on tokenized-target-sentences as system output does not produce perfect scores, although it should. This should be fixed in future iterations.

Apr 07 '22 17:04 pavlo-kuchmiichuk

Here are the data points where the number of sentences still differs in source-sentences and target-sentences:

train:

0058.src.txt
0634.src.txt
0801.src.txt
0807.src.txt
0870.src.txt
0882.src.txt
0913.src.txt
1237.src.txt
1245.src.txt
1529.src.txt
1668.src.txt
1849.src.txt

test:

0303.src.txt
0426.src.txt

Apr 08 '22 10:04 pavlo-kuchmiichuk

ua-gec ua-gec copied to clipboard

Add generated M^2 files

ua-gec
ua-gec copied to clipboard