texthero icon indicating copy to clipboard operation
texthero copied to clipboard

Add a flag to remove_punctuation to prevent removing punctuation in a token

Open jbesomi opened this issue 4 years ago • 2 comments

Overview

During pre-processing, when we need to remove punctuation, sometimes we want to preserve punctuation in the token. Example:

spider-man is powerful, isn't?

In this case, we might expect remove_punctuation to return:

spider-main is powerful isn't

Approach

Need to modify remove_punctuation and add a new argument keep_tokens or remove_in_between or something like that.

For the implementation, we can either tokenize the text (see texthero.preprocessing.tokenize) and remove all "punctuation tokens" or, probably better, add a regex that drops all punctuations symbols that are not between two characters (see again the tokenize function for an example of such regular expression).

Open question

Decide what does the default behaviour looks like. Probably, it's better to remove all punctuation as default but make it clear that there is the opportunity to keep punctuation present in the tokens.

jbesomi avatar Jun 04 '20 05:06 jbesomi

I'd agree that the default behaviour should be to remove all punctuation. I'll try to submit a pull request with these changes later today.

AdamHodgson avatar Jul 08 '20 09:07 AdamHodgson

Hi @AdamHodgson! Thank you. When you submit a PR, please do not forget to test it add new unit tests. If you haven't read it yet, I encourage you to look at the CONTRIBUTING.md document.

jbesomi avatar Jul 08 '20 09:07 jbesomi