indic_nlp_library icon indicating copy to clipboard operation
indic_nlp_library copied to clipboard

Undo wrong Moses tokenization

Open anoopkunchukuttan opened this issue 4 years ago • 1 comments

Some datasets have been pre-processed with Moses tokenizer (or some other tokenizer), which incorrectly handles halant, considering it to be punctuation and adding spaces around it. Add functionality in the normalizer to undo this behaviour.

anoopkunchukuttan avatar Dec 29 '20 12:12 anoopkunchukuttan

Hi @anoopkunchukuttan, can you add an example to it?

tathagata-raha avatar Aug 20 '21 17:08 tathagata-raha