texthero
texthero copied to clipboard
Add a flag to remove_punctuation to prevent removing punctuation in a token
Overview
During pre-processing, when we need to remove punctuation, sometimes we want to preserve punctuation in the token. Example:
spider-man is powerful, isn't?
In this case, we might expect remove_punctuation
to return:
spider-main is powerful isn't
Approach
Need to modify remove_punctuation
and add a new argument keep_tokens
or remove_in_between
or something like that.
For the implementation, we can either tokenize the text (see texthero.preprocessing.tokenize
) and remove all "punctuation tokens" or, probably better, add a regex that drops all punctuations symbols that are not between two characters (see again the tokenize function for an example of such regular expression).
Open question
Decide what does the default behaviour looks like. Probably, it's better to remove all punctuation as default but make it clear that there is the opportunity to keep punctuation present in the tokens.
I'd agree that the default behaviour should be to remove all punctuation. I'll try to submit a pull request with these changes later today.
Hi @AdamHodgson! Thank you. When you submit a PR, please do not forget to test it add new unit tests. If you haven't read it yet, I encourage you to look at the CONTRIBUTING.md document.