ua-gec
ua-gec copied to clipboard
Add generated M^2 files
This PR contains a couple of changes:
- adds means to create a new data representation
annotated-source-sentences
- these are the split source sentences with annotations from the source document present; - adds scripts to automatically create M^2 files necessary for evaluation.
M^2 files derive from the existing data representations, thus if the annotations/tokenization/sentence splitting is faulty, there might be mistakes in M^2 representation as well. Note that currently, running m2-scorer
on tokenized-target-sentences
as system output does not produce perfect scores, although it should. This should be fixed in future iterations.
Here are the data points where the number of sentences still differs in source-sentences
and target-sentences
:
train:
- 0058.src.txt
- 0634.src.txt
- 0801.src.txt
- 0807.src.txt
- 0870.src.txt
- 0882.src.txt
- 0913.src.txt
- 1237.src.txt
- 1245.src.txt
- 1529.src.txt
- 1668.src.txt
- 1849.src.txt
test:
- 0303.src.txt
- 0426.src.txt