confusables icon indicating copy to clipboard operation
confusables copied to clipboard

Also care about invisble characters

Open ChillerDragon opened this issue 3 years ago • 2 comments

There are a few characters that get display as an empty string like those for example:

\u200b
\u200c
\u200d
\u200e
\u200f

They can be mixed into any string and thus bypass the confusable detection since they are not visible the strings look the same

>>> print("f\u200boo")
f​oo

ChillerDragon avatar Jul 25 '21 19:07 ChillerDragon

You can use the unicodedata.category function to go through a string and look up the 'category' of each character, then take them out.

For example, all the characters you listed are part of the "Cf" category ('formatting'). I've found the following categories are a good starting point: Cc, Cf, Cs, Co, Cn, Sk. Though admittedly, I haven't actually looked to see if every single character in all those categories are truly invisible, but I think they are effectively.

Another good one to add is the "Mn" category, which contains all those Ź̴̭͉̻Ă̶̘͖L̴̡̘͓̓̈́̉G̶̗͊O̷̺͖͌̕ characters, so after removing them, the string can be searched as normal.

Here's the code I use:

def remove_unicode_categories(string):
        unicodeCategoriesStrip = ["Mn", "Cc", "Cf", "Cs", "Co", "Cn", "Sk"]
        return "".join(char for char in string if unicodedata.category(char) not in unicodeCategoriesStrip)

textNormalized = remove_unicode_categories(commentText)

You can probably move the category list variable outside the function to make it more efficient but you get the idea.

ThioJoe avatar Feb 17 '22 15:02 ThioJoe

Sounds good. But could this be added to the project?

ChillerDragon avatar Feb 18 '22 08:02 ChillerDragon