confusables
confusables copied to clipboard
Also care about invisble characters
There are a few characters that get display as an empty string like those for example:
\u200b
\u200c
\u200d
\u200e
\u200f
They can be mixed into any string and thus bypass the confusable detection since they are not visible the strings look the same
>>> print("f\u200boo")
foo
You can use the unicodedata.category
function to go through a string and look up the 'category' of each character, then take them out.
For example, all the characters you listed are part of the "Cf" category ('formatting'). I've found the following categories are a good starting point: Cc, Cf, Cs, Co, Cn, Sk. Though admittedly, I haven't actually looked to see if every single character in all those categories are truly invisible, but I think they are effectively.
Another good one to add is the "Mn" category, which contains all those Ź̴̭͉̻Ă̶̘͖L̴̡̘͓̓̈́̉G̶̗͊O̷̺͖͌̕ characters, so after removing them, the string can be searched as normal.
Here's the code I use:
def remove_unicode_categories(string):
unicodeCategoriesStrip = ["Mn", "Cc", "Cf", "Cs", "Co", "Cn", "Sk"]
return "".join(char for char in string if unicodedata.category(char) not in unicodeCategoriesStrip)
textNormalized = remove_unicode_categories(commentText)
You can probably move the category list variable outside the function to make it more efficient but you get the idea.
Sounds good. But could this be added to the project?