pySBD icon indicating copy to clipboard operation
pySBD copied to clipboard

How is accuracy on OPUS-100 computed?

Open bminixhofer opened this issue 1 year ago • 1 comments

Hi! Thanks for this library.

Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]

    correct = 0
    total = 0

    segmenter = pysbd.Segmenter(language="de")

    for sent1, sent2 in zip(sentences, sentences[1:]):
        out = tuple(
            s.strip() for s in segmenter.segment(sent1 + " " + sent2)
        )

        total += 1

        if out == (sent1, sent2):
            correct += 1

    print(f"{correct}/{total} = {correct / total}")

But I get 1011/1999 = 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.

Thanks for any help!

bminixhofer avatar Aug 27 '22 11:08 bminixhofer