indic_nlp_library
indic_nlp_library copied to clipboard
Undo wrong Moses tokenization
Some datasets have been pre-processed with Moses tokenizer (or some other tokenizer), which incorrectly handles halant, considering it to be punctuation and adding spaces around it. Add functionality in the normalizer to undo this behaviour.
Hi @anoopkunchukuttan, can you add an example to it?