indic_nlp_library icon indicating copy to clipboard operation
indic_nlp_library copied to clipboard

Bad sentence splitting performance on flores 200 hindi language

Open asusdisciple opened this issue 1 year ago • 1 comments

I tested the indic nlp package to split sentences on the hindi file in the flores 200 dataset. However the performance is really bad with an F1 score of 0.26. I used the package via the stopes implementation of facebook. My split function looks like this and is applied to a paragraph of 10 sentences. It seems that the package is not recognising a "." as sentence end boundary for some reason. Do you guys have any ideas or proposals?

def split_indic(line: str) -> tp.Iterable[str]:
    """Split Indian text into sentences using Indic NLP tool."""
    line = indic_normalizer.normalize(line)
    for sent in indic_sent_tok.sentence_split(line, lang=lang):
        yield sent

return split_indic

asusdisciple avatar Aug 03 '23 13:08 asusdisciple