Text normalization too aggressive?
The text normalization in Utils.normalize() seems pretty heavy handed for something which is irreversible and non-optional. Additionally, it's not computationally expensive, so it can be done easily by downstream consumers if they want that level of normalization.
On the flip side, if one were going to normalize that heavily, you'd probably also want to do Unicode normalization and output one of the canonical/compatibility forms such as NFKC.
Perhaps this could all be packaged up into a small set of utility methods which are made available, but not run on the base corpus.
The intention here was to have quite "clean" text as the output for NLP purposes, where all the different sorts of dashes, quotations and weird whitespaces cause only problems in downstream tasks (my experiences with dealing with Web genres in argumentation mining and annotations for IR).
We didn't have any other use case in which the "dirty" content should play an important role.
I agree this is irreversible and might be postponed for later stages as an optional step. For this decision, I would wait for the input from the LREC community and their potential use-cases.