flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

Be case_sensitive w.r.t. whitespaces /blank spaces

Open sambaPython24 opened this issue 2 years ago • 2 comments

Hey, consider the following example:

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
# keyword_processor.add_keyword(<unclean name>, <standardised name>)
keyword_processor.add_keyword('Big Apple')#, 'New York')
keyword_processor.add_keyword('Bay Area')

text1_keywords = 'I love Big Apple and Bay Area.'
text2_keywords = 'I love Big  Apple and Bay Area.'

keywords_found_1 = keyword_processor.extract_keywords(text1_keywords)
keywords_found_2 = keyword_processor.extract_keywords(text2_keywords)

keywords_found_1," vs. ",keywords_found_2

For every-day-use it would be beneficiary, that the algorithm has the option to not distinguish between a single and multiple whitespaces. (A version, where the maximal number of whitespaces to be considered as one would be even better)

For text extraction, we can easily preprocess the text and reduce multiple whitespaces a priori to just one. For text replacement the task is much more complicated, since we might want to reduce the whitespaces only in this expression, not for the complete text.

My question is hence: Is it possible to implement this case sensitivity into the algorithm (even if regex patterns are not supported in general)?


The same question arises for line breaks "\n"

sambaPython24 avatar Jun 03 '22 17:06 sambaPython24

I was just looking to implement this as well.

guy4261 avatar Aug 29 '22 18:08 guy4261

@sambaPython24 I managed to do this manually like this:

from flashtext import KeywordProcessor

proc = KeywordProcessor()
keyword = "dog cat"
proc.add_keyword(keyword)

proc.keyword_trie_dict
yy = proc.keyword_trie_dict['d']['o']['g'][' ']
yy[' '] = yy

assert keyword in proc.extract_keywords(keyword)

spaced = "dog cat".replace(" ", "         ")
extracted = proc.extract_keywords(spaced)
assert keyword in extracted, extracted

If you want to modify the library code:

    def __setitem__(self, keyword, clean_name=None):
        """To add keyword to the dictionary
        pass the keyword and the clean name it maps to.

        Args:
            keyword : string
                keyword that you want to identify

            clean_name : string
                clean term for that keyword that you would want to get back in return or replace
                if not provided, keyword will be used as the clean name also.

        Examples:
            >>> keyword_processor['Big Apple'] = 'New York'
        """
        status = False
        if not clean_name and keyword:
            clean_name = keyword

        if keyword and clean_name:
            if not self.case_sensitive:
                keyword = keyword.lower()
                current_dict = self.keyword_trie_dict
            for letter in keyword:
+                _d = current_dict
                current_dict = current_dict.setdefault(letter, {})
+                if letter == " ":
+                    current_dict[" "] = _d
            if self._keyword not in current_dict:
                status = True
                self._terms_in_trie += 1
            current_dict[self._keyword] = clean_name
        return status

guy4261 avatar Aug 29 '22 18:08 guy4261

Could you explain, what you did exactly when you do

proc.keyword_trie_dict
yy = proc.keyword_trie_dict['d']['o']['g'][' ']
yy[' '] = yy

If you want use a big dictionary of keywords, do I have to use this method for any single word? dog -> proc.keyword_trie_dict['d']['o']['g'][' '], cat -> proc.keyword_trie_dict['c']['a']['t'][' ']

sambaPython24 avatar Nov 20 '22 10:11 sambaPython24

@sambaPython24 this is the equivalent of treating every whitespace as a potential sequence of whitespaces. This is what you see in the dog example.
The code change permits this without having to do it for every word in your corpus.

guy4261 avatar Nov 20 '22 11:11 guy4261

i.e. doing the change above

+                _d = current_dict
                current_dict = current_dict.setdefault(letter, {})
+                if letter == " ":
+                    current_dict[" "] = _d

Saves you from doing that for every word:

dog -> proc.keyword_trie_dict['d']['o']['g'][' '], cat -> proc.keyword_trie_dict['c']['a']['t'][' ']

guy4261 avatar Nov 20 '22 11:11 guy4261