data_tooling
data_tooling copied to clipboard
Reason for not applying remove_non_prining_characters normalization
Hi,
We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the remove_non_prining_characters
normalization step is not used for the final cleaning. Do you have any thoughts on why this should not be used?
https://github.com/bigscience-workshop/data_tooling/blob/e28064ec7fb38af5143cafc896e9423a8b12392d/ac_dc/normalization.py#L5
There you have this:
non_printing_characters_re = re.compile(
f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
)
Which we modified, to keep newlines (\n
) and tabs (\t
), and to also remove soft-hyphens, non-breaking spaces, and zero-width space:
additional_chars_to_remove = [160, 173, 8203]
non_printing_characters_re = re.compile(
f"[{''.join(map(chr, list(range(0,9)) + list(range(11, 32)) + list(range(127,160)) + additional_chars_to_remove))}]"
)
There could of course be more characters that one may want to remove.
To be clear, I am writing this here for two reasons:
- To get your feedback. Do you think this is a good idea to use for the final data cleaning?
- If so, this could be incorporated into this repository to help other people that might be thinking about this.
Thanks for your amazing contributions!