fast-autocomplete
                                
                                 fast-autocomplete copied to clipboard
                                
                                    fast-autocomplete copied to clipboard
                            
                            
                            
                        Adding a underscore character to valid characters ignores underscore
Describe the bug
Adding a special character (an underscore) to valid_chars_for_string does not exclude results which do not have the character in the string, until two misses.
To Reproduce
Initialized with
_valid_chars = '_' + string.ascii_lowercase
words = {'i_love_code': {'count': 5}, 'island': {'count': 2}, 'ironman': {'count': 2}, 'i_love_coding': {'count': 2}, 'i_love_machine_learning': {'count': 3}}
autocomplete = AutoComplete(words=words, synonyms={}, valid_chars_for_string=_valid_chars)
autocomplete.search(word=search_string, max_cost=1, size=10)
Formatted output with simulated input:
Valid Characers: _abcdefghijklmnopqrstuvwxyz
Search Input 'i' : i_love_code, i_love_machine_learning, island, ironman, i_love_coding
Search Input 'ir' : ironman
Search Input 'iro' : ironman
Search Input 'iron' : ironman
Search Input 'iron_' : ironman
Search Input 'iron_m' : ironman
Search Input 'iron_ma' : 
Search Input 'iron_mai' : 
Search Input 'iron_maid' : 
Search Input 'iron_maide' : 
Search Input 'iron_maiden' : 
Search Input 'i' : i_love_code, i_love_machine_learning, island, ironman, i_love_coding, iron_maiden
Expected behavior Much like the input 'ir' excludes 'i_love_code' I would expect 'iron_' to exclude 'ironman' and so forth. From this output, it looks like it only begins to exclude 'ironman' when the input reaches 'iron_ma'.
OS, DeepDiff version and Python version (please complete the following information):
- OS: Ubuntu 21.04 + pyenv
- Python 3.9.7
Additional context
This seems to have something to do with the max_cost parameter. If I raise it > 2 it matches even more then the unexpected results.
Hi @lazzarello The fuzzy matching logic still sees enough similarities between them to include it in the results. You are right that the underscore character is treated differently. That's because internally we convert all spaces into underscores. Maybe internally we should switch from using underscore for that purpose to a Unicode character that is barely used.