flashtext
flashtext copied to clipboard
Be case_sensitive w.r.t. whitespaces /blank spaces
Hey, consider the following example:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
# keyword_processor.add_keyword(<unclean name>, <standardised name>)
keyword_processor.add_keyword('Big Apple')#, 'New York')
keyword_processor.add_keyword('Bay Area')
text1_keywords = 'I love Big Apple and Bay Area.'
text2_keywords = 'I love Big Apple and Bay Area.'
keywords_found_1 = keyword_processor.extract_keywords(text1_keywords)
keywords_found_2 = keyword_processor.extract_keywords(text2_keywords)
keywords_found_1," vs. ",keywords_found_2
For every-day-use it would be beneficiary, that the algorithm has the option to not distinguish between a single and multiple whitespaces. (A version, where the maximal number of whitespaces to be considered as one would be even better)
For text extraction, we can easily preprocess the text and reduce multiple whitespaces a priori to just one. For text replacement the task is much more complicated, since we might want to reduce the whitespaces only in this expression, not for the complete text.
My question is hence: Is it possible to implement this case sensitivity into the algorithm (even if regex patterns are not supported in general)?
The same question arises for line breaks "\n"
I was just looking to implement this as well.
@sambaPython24 I managed to do this manually like this:
from flashtext import KeywordProcessor
proc = KeywordProcessor()
keyword = "dog cat"
proc.add_keyword(keyword)
proc.keyword_trie_dict
yy = proc.keyword_trie_dict['d']['o']['g'][' ']
yy[' '] = yy
assert keyword in proc.extract_keywords(keyword)
spaced = "dog cat".replace(" ", " ")
extracted = proc.extract_keywords(spaced)
assert keyword in extracted, extracted
If you want to modify the library code:
def __setitem__(self, keyword, clean_name=None):
"""To add keyword to the dictionary
pass the keyword and the clean name it maps to.
Args:
keyword : string
keyword that you want to identify
clean_name : string
clean term for that keyword that you would want to get back in return or replace
if not provided, keyword will be used as the clean name also.
Examples:
>>> keyword_processor['Big Apple'] = 'New York'
"""
status = False
if not clean_name and keyword:
clean_name = keyword
if keyword and clean_name:
if not self.case_sensitive:
keyword = keyword.lower()
current_dict = self.keyword_trie_dict
for letter in keyword:
+ _d = current_dict
current_dict = current_dict.setdefault(letter, {})
+ if letter == " ":
+ current_dict[" "] = _d
if self._keyword not in current_dict:
status = True
self._terms_in_trie += 1
current_dict[self._keyword] = clean_name
return status
Could you explain, what you did exactly when you do
proc.keyword_trie_dict
yy = proc.keyword_trie_dict['d']['o']['g'][' ']
yy[' '] = yy
If you want use a big dictionary of keywords, do I have to use this method for any single word?
dog -> proc.keyword_trie_dict['d']['o']['g'][' '], cat -> proc.keyword_trie_dict['c']['a']['t'][' ']
@sambaPython24 this is the equivalent of treating every whitespace as a potential sequence of whitespaces. This is what you see in the dog
example.
The code change permits this without having to do it for every word in your corpus.
i.e. doing the change above
+ _d = current_dict
current_dict = current_dict.setdefault(letter, {})
+ if letter == " ":
+ current_dict[" "] = _d
Saves you from doing that for every word:
dog -> proc.keyword_trie_dict['d']['o']['g'][' '], cat -> proc.keyword_trie_dict['c']['a']['t'][' ']