indic_nlp_library Undo wrong Moses tokenization

Undo wrong Moses tokenization

Open anoopkunchukuttan opened this issue 4 years ago • 1 comments

Some datasets have been pre-processed with Moses tokenizer (or some other tokenizer), which incorrectly handles halant, considering it to be punctuation and adding spaces around it. Add functionality in the normalizer to undo this behaviour.

Dec 29 '20 12:12 anoopkunchukuttan

Hi @anoopkunchukuttan, can you add an example to it?

Aug 20 '21 17:08 tathagata-raha

indic_nlp_library indic_nlp_library copied to clipboard

Undo wrong Moses tokenization

indic_nlp_library
indic_nlp_library copied to clipboard