better_profanity
better_profanity copied to clipboard
Consider a trie based approach to maybe increase the overall performance of this package
Hello,
In the context of my application, I developed a simple profanity checker. I compared it to yours and mine runs 10,000X faster.
Note: What I do not take into consideration are varying swear words (sex, s3x, etc.). However, building the Trie
can incorporate these variations. In my context, I do not do it because I check well written book titles and descriptions.
Note 2: I only use a contains-based approach. However, this approach can also use a censor method.
Here is the implementation:
import re
class TrieNode:
def __init__(self):
self.children = {}
self.is_end_of_word = False
class ProfanityChecker:
def __init__(self):
with open("profanity.txt", "r") as f:
self.root = self.build_trie(f.read().splitlines())
# Building the trie: O(m * k),
# where m is the number of words in the dictionary and k is the average length of words.
def build_trie(self, dictionary):
root = TrieNode()
for word in dictionary:
node = root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end_of_word = True
return root
# Searching each word in the text: O(n * k), where n is the number of words in the text.
def contains_profanity(self, text):
def search(remaining_text, node=self.root):
for i, char in enumerate(remaining_text):
if char in node.children:
node = node.children[char]
if node.is_end_of_word and (i == len(remaining_text) - 1 or remaining_text[i + 1] in (' ', '-')):
return True
else:
break # Stop searching if the character is not in the trie
return False
# Remove punctuation and convert to lowercase before tokenizing
text = re.sub(r'[^\w\s]', '', text)
words = text.split()
for word in words:
if search(word.lower()):
return True
return False
if __name__ == '__main__':
pc = ProfanityChecker()
print(pc.contains_profanity("my assessment")) # Output: False
print(pc.contains_profanity("my ass essment")) # Output: True
print(pc.contains_profanity("my ass-essment")) # Output: True