better_profanity icon indicating copy to clipboard operation
better_profanity copied to clipboard

Consider a trie based approach to maybe increase the overall performance of this package

Open mourad1081 opened this issue 1 year ago • 0 comments

Hello,

In the context of my application, I developed a simple profanity checker. I compared it to yours and mine runs 10,000X faster.

Note: What I do not take into consideration are varying swear words (sex, s3x, etc.). However, building the Trie can incorporate these variations. In my context, I do not do it because I check well written book titles and descriptions.

Note 2: I only use a contains-based approach. However, this approach can also use a censor method.

Here is the implementation:

import re

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False


class ProfanityChecker:
    def __init__(self):
        with open("profanity.txt", "r") as f:
            self.root = self.build_trie(f.read().splitlines())

    # Building the trie: O(m * k),
    # where m is the number of words in the dictionary and k is the average length of words.
    def build_trie(self, dictionary):
        root = TrieNode()
        for word in dictionary:
            node = root
            for char in word:
                if char not in node.children:
                    node.children[char] = TrieNode()
                node = node.children[char]
            node.is_end_of_word = True
        return root

    # Searching each word in the text: O(n * k), where n is the number of words in the text.
    def contains_profanity(self, text):
        def search(remaining_text, node=self.root):
            for i, char in enumerate(remaining_text):
                if char in node.children:
                    node = node.children[char]
                    if node.is_end_of_word and (i == len(remaining_text) - 1 or remaining_text[i + 1] in (' ', '-')):
                        return True
                else:
                    break  # Stop searching if the character is not in the trie
            return False

        # Remove punctuation and convert to lowercase before tokenizing
        text = re.sub(r'[^\w\s]', '', text)
        words = text.split()

        for word in words:
            if search(word.lower()):
                return True
        return False


if __name__ == '__main__':
    pc = ProfanityChecker()
    print(pc.contains_profanity("my assessment"))  # Output: False
    print(pc.contains_profanity("my ass essment"))  # Output: True
    print(pc.contains_profanity("my ass-essment"))  # Output: True

mourad1081 avatar Aug 19 '23 09:08 mourad1081