ua-gec icon indicating copy to clipboard operation
ua-gec copied to clipboard

Add generated M^2 files

Open pavlo-kuchmiichuk opened this issue 2 years ago • 1 comments

This PR contains a couple of changes:

  • adds means to create a new data representation annotated-source-sentences - these are the split source sentences with annotations from the source document present;
  • adds scripts to automatically create M^2 files necessary for evaluation.

M^2 files derive from the existing data representations, thus if the annotations/tokenization/sentence splitting is faulty, there might be mistakes in M^2 representation as well. Note that currently, running m2-scorer on tokenized-target-sentences as system output does not produce perfect scores, although it should. This should be fixed in future iterations.

pavlo-kuchmiichuk avatar Apr 07 '22 17:04 pavlo-kuchmiichuk

Here are the data points where the number of sentences still differs in source-sentences and target-sentences:

train:

  • 0058.src.txt
  • 0634.src.txt
  • 0801.src.txt
  • 0807.src.txt
  • 0870.src.txt
  • 0882.src.txt
  • 0913.src.txt
  • 1237.src.txt
  • 1245.src.txt
  • 1529.src.txt
  • 1668.src.txt
  • 1849.src.txt

test:

  • 0303.src.txt
  • 0426.src.txt

pavlo-kuchmiichuk avatar Apr 08 '22 10:04 pavlo-kuchmiichuk