homoglyph
homoglyph copied to clipboard
API request: String Homoglyph.toASCII(String)
Please provide a toASCII API which tries to fit the character in ASCII range and returns a string. For example, the following holds true:
Homoglyph homoglyph = HomoglyphBuilder.build();
assertEquals("The quick brown fox jumps over the lazy dog",
homoglyph.toASCII("Τһе ԛυіϲκ Ьгоѡɴ ғох јυⅿрѕ оⅴег τһе ⅼаzу ԁоɡ"));
It is useful in the scenarios where we want to run complex REGEX rules on (approximate) ASCII representation. Building complex regex tree equivalent with Homoglyph.search() API is not convenient (at least in certain cases).
This would be extremely useful to me. I was looking for a canonicalize function that would do essentially the same thing. I think that, perhaps, canonicalize is a better name than toASCII, since the "base" characters may not be strictly ASCII.
This change is probably more complicated than it first appears. I don't think a single 'canonical' set of characters could be defined that would make sense for everyone, it would vary depending on the language of the user, and also on the expected content of the text (for example should the digit '1' be replaced with the letter 'l' or left as it is?) I think the library would have to allow the user to specify what they considered canonical. In addition it isn't obvious what the correct behaviour should be if letters within the canonical set are homoglyphs of each other - for example if we just say that the 26 letters of the English alphabet are canonical, do we change the digit 1 to lower-case 'L' or to capital 'I'?
I welcome any suggestions regarding a good way to handle this.
Running into this - we'd find it very useful to be able to regex-match including homoglyphs, and normalisation is definitely the only way to handle this.
My suggestion would be to prioritise normalising to letters - the main use-case for a library like this is automated chat moderation; it's unlikely for numbers to be useful matches for problematic content (in my opinion).
Of course, this doesn't solve the latter part of your question - I think the only real solution there is to support generating permutations instead; then they can all be tested.