wordfreq icon indicating copy to clipboard operation
wordfreq copied to clipboard

'narrow no-break space' ("\u202f) is not recognized as a word boundary

Open LBeaudoux opened this issue 4 years ago • 0 comments

Contrary to the 'no-break space' ("\u00A0"), the 'narrow no-break space' ("\u202f") is not recognized as a word boundary.

tokenize("La vois-tu souvent ?", "fr") returns ['la', 'vois', 'tu', 'souvent\u202f'] instead of ['la', 'vois', 'tu', 'souvent']

This is a problem because in French, some punctuation signs like ; : ! ? need to have a non breaking space (ideally a narrow one) between them and the word placed before them.

I suppose one solution would be to modify "TOKEN_RE" in the "tokens" module to take this case into account. Unless, of course, this would create undesirable effects in other languages. Another solution could be to replace "\u202f" by "\u00A0" when preprocessing French texts.

Thank you anyway for sharing this library which is for me essential when it comes to identifying the rarest words in a text.

LBeaudoux avatar May 27 '20 16:05 LBeaudoux