pySBD
pySBD copied to clipboard
How is accuracy on OPUS-100 computed?
Hi! Thanks for this library.
Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:
from datasets import load_dataset
import pysbd
if __name__ == "__main__":
sentences = [
sample["de"].strip()
for sample in load_dataset("opus100", "de-en", split="test")["translation"]
]
correct = 0
total = 0
segmenter = pysbd.Segmenter(language="de")
for sent1, sent2 in zip(sentences, sentences[1:]):
out = tuple(
s.strip() for s in segmenter.segment(sent1 + " " + sent2)
)
total += 1
if out == (sent1, sent2):
correct += 1
print(f"{correct}/{total} = {correct / total}")
But I get 1011/1999
= 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.
Thanks for any help!