nlpaug icon indicating copy to clipboard operation
nlpaug copied to clipboard

Character Augmenters remove non-breaking spaces before punctuation and insert spaces around apostrophes

Open lindsaydbrin opened this issue 2 years ago • 3 comments

When using Character Augmenters (random and keyboard, specifically) on French utterances, I noticed two things:

  • When there is a space before punctuation (non-breaking space, as described here), it is removed by the augmenter.
  • The augmenter adds space before and after an apostrophe.

It seems like both of these would be unwanted behaviors, as ideally the augmenter would only make the change specified in the docs, and not change anything else.

For example, when I run this:

nlpaug.augmenter.char.KeyboardAug(min_char=4, aug_word_max=1, aug_char_p=0.1).augment("un espace avant le point d'interrogation ?", n=1)

I get this:

"un esoace avant le point d ' interrogation?"

lindsaydbrin avatar Oct 12 '22 16:10 lindsaydbrin

It seems there is a general problem with the char augmenters whenever certain punctuation chars are provided. The following is annoying: string.punctuation result: !"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

And this is what happens when applying one of the noted char augs: nac.RandomCharAug(action="insert",).augment(string.punctuation) result: ! " # $% & \' () * +, -. /: ; <= >? @ [\\] ^ _ {|} ~`

Please note, the punctuation list is incomplete.

maxw1489 avatar Nov 24 '22 12:11 maxw1489

Yes this is a huge problem! It needs to be addressed.

Alec-Stashevsky avatar May 19 '23 21:05 Alec-Stashevsky

This is pretty awful. Makes the whole thing unusable if you need this functionality.

fierval avatar Jan 30 '24 23:01 fierval