architecture
architecture copied to clipboard
Internationalization: Using unicode normalization when uniqueness/exact match atters
Context
I was working on an WebAutn authentication provider and noticed that the standard authentication provider does not use unicode normalization when comparing things that need to match exactly/need to be unique (usernames/password).
This is a recommendation is a (NIST) recommendation that prevents errors when the same unicode string is represented in multiple ways.
I implemented unicode normalization in my authentication provider but feel that it is an anti-pattern to duplicate password-hashing code. I have tests for my implementation.
Proposal
Apply NFKC normalization to strings that need to be checked for uniqueness or used in hashes.
At a minimum this includes user-entered names, usernames, and passwords.
This follows section 5.1.1.2 in NIST sp800-63b
Consequences
In rare cases, two usernames that appear distinct but are not distinct may conflict. In rare cases, passwords may not match when they are ambiguous. In that case, the exact unicode encoding would depend on the browser at the moment.
Example from unicode.org:
# given that unicode_normalize is a function that applies NFKC normalization
assert "A\u0308ffin" == unicode_normalize("Äffin")
assert "A\u0308ffin" == unicode_normalize("Ä\uFB03n")