url
url copied to clipboard
Unicode normalization could change the structure of a URL
The current URL standard is subject to the HostSplit security attack and other similar ones where compatible URL records (in terms of the Unicode standard) could be parsed significantly differently.
Problematic URL Records
The following URLs are linked to their parsing results for convenience and I used NFKC for the sake of uniformity. NFD has a similar but different issue (shown in the appendix below) and NFKD is arguably the most dangerous because it inherits all the problems from both NFKC and NFD.
Examples with Changing Hostnames
-
Before NFKC:
https://whatwg.org#@evil.com- username:
whatwg.org%EF%BC%83 - hostname:
evil.com
After NFKC:
https://whatwg.org#@evil.com- hostname:
whatwg.org(changed fromevil.com) - hash:
#@evil.com
- username:
-
Before NFKC:
https://whatwg.org/@evil.com- username:
whatwg%2Eorg%EF%BC%8F - hostname:
evil.com
After NFKC
https://whatwg.org/@evil.com- hostname:
whatwg.org(changed fromevil.com) - pathname:
/@evil.com
- username:
Examples with Changing Usernames
-
Before NFKC:
https://user:evil:[email protected]- username:
user%EF%BC%9Aevil - password:
pass
After NFKC:
https://user:evil:[email protected]- username:
user(changed fromuser%EF%BC%9Aevil) - password:
evil:pass
- username:
Examples with Changing Pathnames
-
Before NFKC:
https://example.com/﹖a=b- pathname:
/%EF%B9%96a=b
After NFKC:
https://example.com/?a=b- pathname:
/(changed from﹖a=b) - search:
?a=b
- pathname:
Security Risks
All these examples showed that the structure of a URL could be significantly changed by Unicode normalization. Applications which process Unicode strings could unintentionally change the structure of URLs; such discrepancies can then be used to confuse end users or even bypass security checks. The attacks are not limited to the above examples and multiple URL components can be changed at the same time.
Current Prevention Mechanisms
UTS 46 and related standards specified the STD3 Rules that disallow or map problematic characters in host names. Unfortunately, it also forbids the underscore, which created tension with some existing hosts. Instead of directly using the STD3 Rules, the current URL standard checks forbidden host code points after mapping those problematic characters, which in a sense could be seen as a weaker version of the STD3 Rules.
However, all these checks focus only on host names, not other parts of URLs which are also vulnerable. Therefore, they are not sufficient to prevent the attacks shown above. To prevent such attacks, a similar restriction needs to be imposed on all parts of URL records, not just host names.
Suggested Changes to the Standard
A proper fix would probably be similar to the checks we have done to host names. I am not an expert in the URL and Unicode standards, merely a concerned user after reading those documents. If this is an actual security concern, other people should step in and propose a fix to prevent such attacks.
Appendix 1: Problems with NFD (and also NFKD)
NFD and NFKD will normalize the following three characters to generate potentially forbidden code points:
\u2260(≠as one code point) to=\u0338(≠as two code points)\u226E(≮as one code point) to<\u0338(≮as two code points)\u226F(≯as one code point) to>\u0338(≯as two code points)
The generated <, > and = code points could potentially change the URL structure or even the document structure containing URLs (e.g., HTML).
Interestingly, while UTS 46 demands that a checker banning <, >, or = should also ban their negation forms, the current URL standard does not impose this restriction and would allow their negation forms in host names (potentially Punycode-encoded).
Appendix 2: Further Information about NFC
All the above discussions assumed that NFC is safe. However, my exchanges with WHATWG gave me the impression that even NFC could be regarded as problematic. If the URL standard intends to use arbitrary code points and expect that all code points should be preserved as they are, then NFC could also pose a threat to the semantics of URL records.
Edits
- 8/4: "sanitization" was changed to "checks"
- 8/5: an appendix was added to elaborate on the danger of NFD and NFKD.
- 8/6: yet another appendix was added to discuss NFC.
I have to say that I don't see the problem here. # is not #. Are you suggesting that someone who looks at a URL, but does not use a parser to obtain the host might be misled? Because that's not a thing we try to protect against.
I have to say that I don't see the problem here.
#is not#. Are you suggesting that someone who looks at a URL, but does not use a parser to obtain the host might be misled? Because that's not a thing we try to protect against.
No, a standard-compliant parser is always used before and after the Unicode normalization. It's not (just) someone checking the URLs using their human eyes and feeling confused---one can already craft long, heavily percent-encoded URLs to baffle any human being. The security issue here is that things can still go wrong even for standard-compliant parsers.
I did not elaborate on the actual attacks because I intended to just post a summary here. Let me elaborate on one attack to OAuth2 by exploiting this. In CVE-2019-0654 for example, Microsoft IE/Edge dangerously normalized the content of the HTTP Location header (used in a typical OAuth2 usage to redirect users back to the requesting website), and thus you could use a URL to steal the OAuth2 token. More specifically, when the authorization server sees
https://evil.com#@google.com
It would think the request is from Google and maybe approve it because of that. But when the redirect URL (with the token appended) is sent back to the user using the older versions of Microsoft IE/Edge, it will be normalized to
https://evil.com#@google.com?code=xxxxx&session_state=...
and thus the OAuth2 token xxxxx---the most important part of OAuth2---will be sent to evil.com instead of google.com. At this point, the attack has succeeded.
My main point is that this is more serious than an inconvenience. Here are a few other similar CVEs already listed in the HostSplit presentation:
- CVE-2019-0657: .NET and Visual Studio
- CVE-2019-9636 and CVE-2019-10160: Python
- CVE-2019-2816: Java
I am not a security expert so there might be more creative ways to exploit it. Given how widespread URLs are in various standards, some directly maintained by WHATWG, I believe this is a serious concern the working group should address. I personally prefer forbidding these characters, but if the working group decides to continue allowing these dangerous characters in URLs, I urge the working group to at least include a vivid security warning in the standard.
That seems like an issue with the authorization server (and potentially older versions of IE/Edge). Why would it not send back the serialization of what it parsed? Also, who does the normalization from # to #? That's not allowed.
It's also not clear how this relates to STD3 as that is purely about ASCII code points and doesn't do normalization of non-ASCII code points as far as I know. (And you cannot outlaw https://evil.com#@google.com as that's a perfectly valid URL string; identical to https://evil.com/#@google.com.)
Why would it not send back the serialization of what it parsed?
Because that's how OAuth2 works. It seems you are proposing a new standard to replace OAuth2.
Also, who does the normalization from
#to#? That's not allowed.
Perhaps you assumed that normalization should never happen, but Unicode normalization is necessary to implement the Unicode standard efficiently and correctly. One cannot avoid it. It is just that certain normalization algorithms could mess up URL structures. # and # are "compatible" according to the Unicode standard and NFKC and NFKD would bring # to #. (NFD could also have problems, but it is probably safe in this case.)
I am fully aware of W3C's general recommendation that NFC (and perhaps percent-encoding related stuff) should be used, but most people are not aware of the subtle differences between normalization forms and security bugs have been reported for many major frameworks as shown above.
It's also not clear how this relates to STD3 as that is purely about ASCII code points and doesn't do normalization of non-ASCII code points as far as I know.
"STD3 Rules" and "STD3" are technically different. I will review and revise my original proposal in case my wording was confusing. STD3 is a standard about ASCII code points. STD3 Rules (not STD3 itself) are to prevent dangerous non-ASCII code points which could generate ASCII code points disallowed by the STD3 after certain normalization. Please take a look at https://unicode.org/reports/tr46/#STD3_Rules and also the IDNA mapping table.
(And you cannot outlaw
https://evil.com#@google.comas that's a perfectly valid URL string; identical tohttps://evil.com/#@google.com.)
I only proposed to outlaw https://whatwg.org#@evil.com on the basis that # could be normalized to #. I am sorry for the wrong impression that I intended to outlaw both. By "dangerous/problematic" I was only referring to # or other characters with similar properties.
So OAuth2 accepts a URL string with Unicode in it and then:
- Parses that URL string into a URL for its own purposes. (And it might terminate here if the URL is not adequate for some reason.)
- Sends back that URL string for another consumer, but now NFKC/D normalized?
How is that not a bug in OAuth2?
You cannot apply Unicode normalization to all inputs, certainly not URL strings. They should only go into the URL parser.
I could see trying to disallow # and similar code points, but pipelines that do this kind of (bogus) normalization on URL strings would still be susceptible to attacks, depending on when they perform the (bogus) normalization.
So OAuth2 accepts a URL string with Unicode in it and then: 1. Parses that URL string into a URL for its own purposes. (And it might terminate here if the URL is not adequate for some reason.) 2. Sends back that URL string for another consumer, but now NFKC/D normalized? How is that not a bug in OAuth2?
The OAuth2 standard, as far as I can see, is silent on Unicode normalization forms. I prefer leaving the judgment whether OAuth2 is at fault to others.
You cannot apply Unicode normalization to all inputs, certainly not URL strings. They should only go into the URL parser.
I18nWG currently recommends applying NFC to everything, which should include URLs. For example, they say the entire HTML and CSS files should be in NFC. If WHATWG disagrees with this, I suggest WHATWG work with I18nWG to reach a consensus.
I could see trying to disallow
#and similar code points, but pipelines that do this kind of (bogus) normalization on URL strings would still be susceptible to attacks, depending on when they perform the (bogus) normalization.
That's correct. Personally, I can be satisfied with a mechanism to detect URLs that would break incorrect pipelines; that is, a standardized, step-by-step procedure for security-sensitive applications (e.g., OAuth2 servers) to follow in case they want to rule out all problematic URLs out of an abundance of caution.
Any kind of Unicode normalization that changes the scalar values of the URL string might change its meaning. A robust setup would eagerly parse a URL string and then serialize the resulting URL record before doing any kind of normalization. At that point the normalization would no-op as a serialized URL is pure ASCII.
I don't think this is in conflict with the recommendation from the i18n WG, but there are some tricky nuances. (As in, if you have a URL string whereby applying NFC would change it and its meaning, you probably better not share it in that form.)
In particular, I think the i18n WG recommendation is that documents should be written in NFC. It's not that you should take an arbitrary document, apply NFC to it, and then assume the URLs (or other semantic contents) are unchanged.
If you are converting a document to NFC you need to be careful, e.g. by using a process like the one @annevk describes.
At this point I do not really have anything technical to add. As for the I18n WG, the webpage at https://www.w3.org/International/questions/qa-html-css-normalization says:
If you do have problems, you should find an editor or conversion tool that allows you to specify the normalization form, and use that to re-save your page.
I interpreted this suggestion as "it is okay to apply NFC to any HTML and CSS". However, I am not planning to argue strongly about this (in my opinion) minor point.
I'll reiterate my suggestion I gave elsewhere about raising this concern on https://lists.w3.org/Archives/Public/www-international/.
I'll reiterate my suggestion I gave elsewhere about raising this concern on https://lists.w3.org/Archives/Public/www-international/.
@annevk Thanks for your suggestion, but I am sorry I feel quite discouraged at this point. If you, @domenic, or anyone else thinks this is a genuinely interesting issue, please feel free to bring up the issue in the mailing list by yourself. I also assume that you two have much better connections and more social power to make changes happen. It is not always easy for a complete outsider like me to participate.
My personal belief is still that the URL standard is the best place to include warnings about the danger of Unicode normalization, because most people would not separately check the details of IDNA and Unicode standards or i18n documents unless they are already aware of their complexity. Therefore, I am much less motivated to (only) work the i18n WG to promote safer practice. I simply don't think it's an effective way to achieve my goal. My viewpoint however does not seem to be popular within WHATWG, and I respect that, but it does not make sense for me to act against my belief. I encourage you or anyone holding your view to use the mailing list by themself if they believe that is the best place to handle this issue.
Despite our disagreements, I thank you and @domenic again for your time and efforts in handling the communication.
Fair enough. I guess I might as well reuse this issue.
@whatwg/i18n given that URL strings can use "arbitrary" code points, they (and the resulting URL record) are subject to change if folks use Unicode normalization. Is this something that should be called out more prominently on https://www.w3.org/International/questions/qa-html-css-normalization and elsewhere?
Thanks. I updated the main text with two new appendixes that discuss NFC, NFD, and NFKD, in case they are helpful. (Originally, I only focused on the most interesting case NFKC, and now all normalization forms are mentioned.)
W3C-I18N discussed this in our teleconference of 2022-08-18 and I drew an action item to reply to this thread.
In general we agree with @annevk and @domenic's statements on this thread. If one applies Unicode normalization to a text file representing an HTML or CSS document, that can spoil code point sequences that were deliberately not normalized to start with.
Our Character Model document on string matching and our WG do not blindly recommend NFC (and definitely do not recommend any of the K forms) for Web content. We do recommend that content authors choose to use NFC wherever practical for their language, since this promotes interoperability (and since this is what most--but not all--keyboards produce). Our recommendations today are subtly different than they were e.g. 10 years ago: we think NFC is good, but we tell spec writers and implementers not to change the normalization form of content unless the user specifically asks them to do so.
As a result of this issue, we intend to revise our article https://www.w3.org/International/questions/qa-html-css-normalization in the near future.
FWIW, i uploaded a proposed rewrite of https://www.w3.org/International/questions/qa-html-css-normalization to https://w3c.github.io/i18n-drafts/questions/qa-html-css-normalization.en.html. If you have comments please send quickly so that we can proceed to publish.
@r12a Thank you! I really like the improvement. Some nitpicking:
-
For the question "What are normalization forms?" the text did not say much about K. Personally I feel we could explain what it is and why it's bad, or outrightly discommend it (as in the answer to the other question, maybe with a link to the Unicode standards for interested readers).
-
In the question "Converting the normalization form of a page", I personally want to emphasize the structures of URLs, HTML documents, and many others (not just individual components in them) could change. Notably, the following three cases are problematic for conversion between NFC and NFD:
- \u2260 (≠ as one code point in NFC) and =\u0338 (≠ as two code points in NFD)
- \u226E (≮ as one code point in NFC) and <\u0338 (≮ as two code points in NFD)
- \u226F (≯ as one code point in NFC) and >\u0338 (≯ as two code points in NFD)
These cases mean that <, >, = could appear or disappear due to the conversation between NFC and NFD, and they play an important role in too many web standards. An application that does any automatic conversion is subject to injection attacks, and such attacks have happened before. This is why I started this GitHub thread, and I personally still believe the standards should ban these problematic sequences instead of telling application authors to be careful. Given that we decided to update the FAQ, I hope at least the FAQ could emphasize the dire consequence of not following the advice.
In any case, thanks for your great work!
@favonia what name should i use for acknowledgements?
I use "Kuen-Bang Hou (Favonia)" in my papers, so maybe just that. I'm also fine with no acknowledgement at all! Thanks. @r12a
Closing this issue because I believe there's nothing to do. :slightly_smiling_face: