indic_nlp_library
indic_nlp_library copied to clipboard
Bad sentence splitting performance on flores 200 hindi language
I tested the indic nlp package to split sentences on the hindi file in the flores 200 dataset. However the performance is really bad with an F1 score of 0.26. I used the package via the stopes implementation of facebook. My split function looks like this and is applied to a paragraph of 10 sentences. It seems that the package is not recognising a "." as sentence end boundary for some reason. Do you guys have any ideas or proposals?
def split_indic(line: str) -> tp.Iterable[str]:
"""Split Indian text into sentences using Indic NLP tool."""
line = indic_normalizer.normalize(line)
for sent in indic_sent_tok.sentence_split(line, lang=lang):
yield sent
return split_indic