nlpaug
nlpaug copied to clipboard
Character Augmenters remove non-breaking spaces before punctuation and insert spaces around apostrophes
When using Character Augmenters (random and keyboard, specifically) on French utterances, I noticed two things:
- When there is a space before punctuation (non-breaking space, as described here), it is removed by the augmenter.
- The augmenter adds space before and after an apostrophe.
It seems like both of these would be unwanted behaviors, as ideally the augmenter would only make the change specified in the docs, and not change anything else.
For example, when I run this:
nlpaug.augmenter.char.KeyboardAug(min_char=4, aug_word_max=1, aug_char_p=0.1).augment("un espace avant le point d'interrogation ?", n=1)
I get this:
"un esoace avant le point d ' interrogation?"
It seems there is a general problem with the char augmenters whenever certain punctuation chars are provided. The following is annoying:
string.punctuation
result:
!"#$%&\'()*+,-./:;<=>?@[\\]^_
{|}~`
And this is what happens when applying one of the noted char augs:
nac.RandomCharAug(action="insert",).augment(string.punctuation)
result:
! " # $% & \' () * +, -. /: ; <= >? @ [\\] ^ _
{|} ~`
Please note, the punctuation list is incomplete.
Yes this is a huge problem! It needs to be addressed.
This is pretty awful. Makes the whole thing unusable if you need this functionality.