unicode-confusable
unicode-confusable copied to clipboard
Canonical versions?
It would be cool to be able to use this to prevent similar looking usernames in a DB. What would it take to make that happen? I imagine one thing that would be helpful is to convert a string into some identifier or canonical function.
Unicode::Confusable.clarify("ℜ𝘂ᖯʏ") => "Ruby"
# or
Unicode::Confusable.id_for("ℜ𝘂ᖯʏ") => "123abc"
Unicode::Confusable.id_for("Ruby") => "123abc"
The comparing process creates a "skeleton" of a string, which is directly compared (= same bytes) with the other string's skeleton. I haven't mentioned in the ReadMe how to generate it, but it is available as a public method:
Unicode::Confusable.skeleton("ℜ𝘂ᖯʏ") #=> "Ruby"
However, it is not a unique representation, quoting the standard:
Note: The strings skeleton(X) and skeleton(Y) are not intended for display, storage or
transmission. They should be thought of as an intermediate processing form, similar to a
hashcode. The characters in skeleton(X) and skeleton(Y) are not guaranteed to be identifier
characters.
So it cannot solely act as identifier, but personally, I cannot see what would be wrong with storing it in an additional database column, which you can then use for skeleton lookups.
Related specifications (unicode identifiers/security):
- http://www.unicode.org/reports/tr31/
- http://www.unicode.org/reports/tr39/
- http://www.unicode.org/reports/tr36/
Awesome. Any idea why the spec says you can't use them for identifying purposes? Seems pretty accurate to me from some of the tests I ran.