better_profanity
better_profanity copied to clipboard
Add get censored words & censor middle only features
Provided a solution for the issue #34. Sorry I kind of messed up with branches so this commit is merged with the other PR I created (#35).
Again, it doesn't break anything and can only be used if get_censored_words
is True
It basically returns a Tuple of (str, list)
with the str
being the original censored text and the list
being the list of censored words.
Usage:
from better_profanity import profanity
if __name__ == "__main__":
profanity.load_censor_words()
text = "test fucking shit"
censored_text, censored_words = profanity.censor(text, get_censored_words=True)
print(censored_words)
# ['fucking', 'shit']
Separated the functions as you said but while writing unit tests and testing edge cases, I just realized that the current _hide_swear_words
function merges multiple swear words into one when they're next to each other. Though both functionalities work well for single unseparated swear words, they don't behave well in these kind of situations.
Example with get_censored_words:
bad_text = "Dude, I hate shit. Fuck bullshit."
profanity.get_censored_words(bad_text)
>>>['shit', 'bullshit']
# It completely ignored "Fuck" since they're merged
bad_text = "That wh0re gave m3 a very good H@nD j0b."
profanity.get_censored_words(bad_text)
>>>['wh0re', 'H@nD']
# It didn't include "j0b" since they're separated with space
Example with middle_only (same issues):
bad_text = "Dude, I hate shit. Fuck bullshit."
profanity.censor(bad_text, middle_only=True)
>>>"Dude, I hate s**t b******t."
# It completely ignored "Fuck" since they're merged
bad_text = "That wh0re gave m3 a very good H@nD j0b."
profanity.censor(bad_text, middle_only=True)
>>>"That w***e gave m3 a very good H**D."
# It didn't include "j0b" since they're separated with space
To solve that, I simply put a check before merging swear words (only merge if: not (get_censored_words or middle_only)
). The results are better and they pass all other unit tests, but bit of a coverage your method provided has disappeared. Which means it'll detect less swear words and it might result in inconsistencies.
Example with get_censored_words:
bad_text = "Dude, I hate shit. Fuck bullshit."
profanity.get_censored_words(bad_text)
>>>['shit', 'Fuck', 'bullshit']
bad_text = "That wh0re gave m3 a very good H@nD j0b."
profanity.get_censored_words(bad_text)
>>>['wh0re']
# It didn't include "H@nD j0b"
Example with middle_only (same issues):
bad_text = "Dude, I hate shit. Fuck bullshit."
profanity.censor(bad_text, middle_only=True)
>>>"Dude, I hate s**t. F**k b******t."
bad_text = "That wh0re gave m3 a very good H@nD j0b."
profanity.censor(bad_text, middle_only=True)
>>>"That w***e gave m3 a very good H@nD j0b."
# It didn't include "H@nD j0b"
Maybe we should just follow this method and warn users of these possible issues? I think it's a pretty mild edge case anyway, but it's up to you.