data_tooling icon indicating copy to clipboard operation
data_tooling copied to clipboard

Reason for not applying remove_non_prining_characters normalization

Open JoeyOhman opened this issue 2 years ago • 1 comments

Hi,

We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the remove_non_prining_characters normalization step is not used for the final cleaning. Do you have any thoughts on why this should not be used?

https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/normalization.py#L5

There you have this:

non_printing_characters_re = re.compile(
    f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
)

Which we modified, to keep newlines (\n) and tabs (\t), and to also remove soft-hyphens, non-breaking spaces, and zero-width space:

additional_chars_to_remove = [160, 173, 8203]
non_printing_characters_re = re.compile(
    f"[{''.join(map(chr, list(range(0,9)) + list(range(11, 32)) + list(range(127,160)) + additional_chars_to_remove))}]"
)

There could of course be more characters that one may want to remove.

To be clear, I am writing this here for two reasons:

  1. To get your feedback. Do you think this is a good idea to use for the final data cleaning?
  2. If so, this could be incorporated into this repository to help other people that might be thinking about this.

Thanks for your amazing contributions!

JoeyOhman avatar May 20 '22 11:05 JoeyOhman