html Validating internationalized mail addresses in <input type="email">

This is more or less the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 but I think it's worth another look since a lot of things have changed.

The issue is that the e-mail address validation pattern in sec 4.10.5.1.5 only accepts ASCII addresses, not EAI addresses. Since last time, large hosted mail systems including Gmail, Hotmail/Outlook, Yahoo/AOL (soon if not yet), and Coremail handle EAI mail. On smaller systems Postfix and Exim have EAI support enabled by a configuration flag.

On the other side, writing a Javascript pattern to validate EAI addresses has gotten a lot easier since JS now has Unicode character class patterns like /(\p{L}|\p{N})+/u which matches a string of letters and digits for a Unicode version of letters and digits.

Last time around the consensus seemed to be that EAI input fields should be marked as unicode or eai or the like, since it'll be a while since all mail systems handle EAI.

For the avoidance of doubt, when I say EAI, I mean both Unicode local parts and Unicode domain names, since that's what EAI mail systems handle. There is no benefit to translating IDNs to A-labels (the ones with punycode) since that's all handled deep inside the mail system.

Apr 24 '19 02:04 jrlevine

Hey, coming here from this chrome bug. If I understand correctly, this means that we would send the email user@ß.com to the server as user@ß.com instead of the punycoded version [email protected] like we do today, and we would also allow ß@ß.com to pass validation and send it as ß@ß.com. After reading the concern in this comment, I have a hard time believing that we wouldn't break some servers somewhere. Just because mail servers tend to accept more unicode doesn't mean that every mail server everywhere does now, right?

Jun 06 '20 07:06 josepharhar

@josepharhar I agree that some servers can break (old ones but f.x. in Poland most popular e-mail providers are ... not working as they should) but please remember that we are still saying about client-side e-mail field validation.

RFC 6532 was not supported for a long time in many software apps (f.x. Thunderbird makes really strange things when receives non-encoded UTF-8 mail compilant with RFC 6532 - it's still open in Bugzilla) but up-to-date mail servers allow to create such accounts and send such mails (Postfix has support for it since ~2015). It's complex problem as f.x. delivery of UTF-8 mail to old mailbox can lead to some problems but what can we do else other than progressively upgrade used technologies to support it? :)

Anyway, I don't think that it's browser responsibility to "protect backend from problematic e-mail addresses" so if RFC allows it and up-to-date software supports it, we should allow it.

Jun 06 '20 07:06 rutek

It's more complex than that, and it's not about ß which is an odd special case.

EAI (internationalized) mail can handle addresses like пример@Бориса.РФ. While the domain part can turn into ASCII A-labels xn--80abvxkh.xn--p1ai (sometimes called punycode), the mailbox cannot, and only an EAI mail system can handle that address. Common MTAs like postfix and exim have EAI support but it's not turned on by default, and there is no way a browser can tell what kind of MTA a remote server has or how it is configured. That's why we need a new input type="eaimail" that accepts EAI addresses, which web sites can use if their MTA handles EAI.

The treatment of ß has nothing to do with this. The obsolete IDN2003 and current IDN2008 internationalized domain names are almost the same but one of the few differences is that 2003 normalizes (not punycodes) ß to ss while 2008 makes it a valid character. An address with an ASCII mailbox like user@ß.com could turn into [email protected] but ß@ß.com is EAI only. This turns out to matter because there are German domain names with ß in them that your browser cannot reach if it uses the obsolete rules. See my page https://fuß.standcore.com to see what your browser does.

Jun 06 '20 14:06 jrlevine

A few tiny additions and clarifications to John Levine;'s note (we do not disagree about the situation in any important way; the issues are just a bit more complex, with potentially broader implications, that one might infer from his message and they may call part of his suggestion into question). In particular, "eaimaill" or something like it may be the wrong solution to the problem and may dig us in even deeper. For those who lack the time or inclination to read a fairly long analysis and explanation, skip to the last paragraph.

First, while his explanation of the difficulty with ß is correct, it is perhaps useful to also note that the ß -> ss transformation is often brought about by the improper or premature application of NFKC, which may have been the source of the recent dust-up about phishing attacks using Mathematical special characters. In the latter case, IDNA2008 imposes a requirement on "lookup applications" (including browsers) to check for and reject such things but they obviously cannot do so if the characters the IDNA interface sees are already transformed to something valid. The current version of Charmod-norm discusses, and recommends against, general application of compatibility mappings. It is perhaps also worth noting that UTS #46 is still recommending the use for NFKC (as part of NFKC_Casefold and its associated tables (see Section 5 of that document)) but also calls out the problem of reaching some IDNA2008-conformant domain names if the IDNA2003 rules are followed. Because, from observation, some (perhaps many or most) browsers look to UTS #46 for authority in interpreting domain names in, e.g., URLs while most or all SMTPUTF8 implementations (incorrectly, but commonly, known as "EAI") are strictly conformant to IDNA2008, the differences between the two introduces additional complications .

John mentions that a browser cannot tell what the MTA and configuration a remote server might have, but it is even worse than that. In general, the browser is unlikely to know very much about the precise capabilities of the local MTA or Message Submission Server (MSA() unless those functions are actually built into the browser. The web page designer is even less likely to know and is in big trouble if different browsers behave differently. If the browser does not know, or cannot be configured to know, the distinction between an input type="email" and one of ""eaimail" (which I hope would be called something else, perhaps "i18nemail") would not be as useful as his message implies.

Thinking about these issues in terms of what mail systems do with the addresses my miss an important issue. In many cases, web pages are trying to accept and validate something that looks like an email address but is not headed immediately into a mail system. Instead, it is destined for insertion into a database or comparison with something already there, validation by some other process entirely, or is actually an email address (or something that looks like one) used as a personal identifier such as a user ID. For the latter case, conversion of the part of the string following the "@" via the Punycode algorithm may not produce a useful result whether IDNA2008, IDNA2003, or UTS #46 rules are used. I would think it would be dumb, but if someone wanted to allow 3!!!@#$%^&.ØØØ as a user ID and some system wants to allow that, we should probably stay out of their way (perhaps by insisting they use a type that does not imply an email address). However, the other side of that example is probably relevant to the discussion. The operator or administration of a mail server, or the administrator of a system that uses email addresses as IDs, gets to pick the addresses they will allow. Especially in the ID case, if they use a set of rules narrower than what RFC 5821 allows (and that are allowed in addresses on many mail systems), then they open themselves up to many frustrations and complaints from from users whose email addresses are valid according to the standards and work perfectly well on most of the Internet but that are rejected by their systems. Internationalized addresses open up a different problem. As an example, I don't know many mail servers identified by domains subsidiary to the 公益 TLD have allowed registration of local parts in Tamil or Syriac scripts, but I suspect that "zero" wouldn't be a bad guess. Someone designing a web site for users in China might know that and, for the best quality user experience, might want to reject or produce messages about non-Chinese local parts for that domain or perhaps even for any Chinese-script and China-based TLD. Similar rules might be applied in other places to tie the syntax of the local part to the script of the TLD but, for example in countries where multiple scripts are in use and "official", such rules might be a disaster. And, because almost anyone can set up an email server and there are clearly people on the Internet who prioritize being clever or cute or exhibiting a maximum of their freedom of expression over what others might consider sensible or rational, most of us who have been around email for many years have seen some truly bizarre (but valid) local parts of all-ASCII addresses and see no reason to believe we won't see even worse excesses as the Internet becomes increasingly internationalized.

This leads me to a conclusion that is a bit different from when this was discussed at length over a year ago. As we have seen when web sites reject legitimate ASCII local parts because people somehow got in into their heads that most non-alphanumeric characters were forbidden or were stand-ins for something else and, more broadly, because it is generally impossible to know what a remote MTA with email accounts on it will allow in those accounts, trying to validate email addresses by syntax alone is hard and may not be productive. When one starts considering email addresses (or things that look like them) that contain non-ASCII characters, things get much more difficult. IDNA2008, IDNA2003, and UTS#46 (in either profile) each have slightly different ideas about what they consider valid. Whatever any of them allow is going to be a superset of what any sensible domain or mail administrator or will allow in practice. In general, a browser does not know what conventions back-end systems or a mail system at the far end of the Internet are following, much less whether they will be doing the same thing next month. So my suggestion would be that Input type="email" be interpreted and tested only as "sort of looks like an all-ASCII email address", that a new input type="i18nmail" be introduced as "looks like 'email' but with some non-ASCII characters strewn around", and that the notion of validating beyond those really general rules be left to the back-end systems, the remote "delivery" MTAs, and so on. In addition, to the extent to which one cares about the quality of the user experience, it may be time to start redesigning the APIs associated with various libraries and interfaces to that they can report back real information about why putative email addresses didn't work for them more precise than "failed" or "invalid address".

good luck to us all, john

Jun 07 '20 22:06 klensin

FYI, new installs of Postfix get EAI enabled by default.

My take is that a new input type is not required. An attribute by which to reject EAI is fair (e.g., because the site's MTAs don't support EAI on outbound.

Jun 10 '20 21:06 nicowilliams

s/reject/accept/ and I agree

Jun 10 '20 21:06 jrlevine

Validation on the front-end creates more ways to lose rather than more ways to win, and doesn't really protect the backend from vulnerabilities.

So I'm just not very keen on the browser doing much validation here. If the site operator has / does not have a limitation as to outbound email, I'm fine with stating it, but I'm also fine with allowing whatever, and making it the backend's job (or any scripts' on the page) to do any validation.

My take is that the default should be permissive. This should be how it is in general. Consider what happens otherwise. You might have a page and site that can handle EAI just fine but a developer forgot to update their email inputs on their pages to say so: now you have a latent bug to be found by the first user who tries to enter an internationalized address. This might mean losing user engagement, and you might never find out because why would the users tell you? But, really, why do we need the input to do so much validation? The input has to be plausibly an email address -- a subset of RFC5322, [email protected] is plenty good enough for 99.999% of users, and there is no good validation to apply to the mailbox part. This is how users get upset that they can't have [email protected]. We should stop that kind of foot self-shooting.

Jun 10 '20 22:06 nicowilliams

The user should able to enter an email address verbatim, with no second-guessing by input forms. If that address is known to be a-priori unworkable by the server's backend system, it can be rejected with an appropriate error message on the initial POST. Otherwise, if the address vaguely resembles mailbox syntax, it should be accepted and used verbatim. It may not be deliverable, but that's also true of many addresses that are syntactically boring [email protected] may bounce while виктор1βετα@духовный.org may well be deliverable...

Jun 10 '20 22:06 vdukhovni

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.

The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value. This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Jun 11 '20 01:06 masinter

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.

The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value. This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses.

Jun 11 '20 01:06 jrlevine

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines he value attribute, if specified and not empty, must have a value that is a single valid e-mail address. The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value. This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses.

The PCRE pattern behind the link is rather busted. It fails to properly validate dot-atoms, allowing multiple consecutive periods in unquoted local-parts (invalid addresses), while disallowing quoted local-parts (valid addresses). EAI-aside, this sort of fuzzy approximation of the actual requirements is harmful.

Jun 11 '20 03:06 vdukhovni

Hil Maybe it would be helpful to back up a little bit an look at this from the perspective of a fairly common use case. Suppose I have a web site that sets up or uses user accounts and that I've decided to use email addresses as user IDs (there are lots of reasons why that isn't a good idea, but the horse has left the barn and vanished over the horizon). Now, while it would probably not be a good practice, there is no inherent requirement that my system ever send email to that address -- it can be, as far as I'm concerned, just a funny-looking user ID. On the other hand, if I tell a user who has been successfully using a particular email address for a long time that their address is invalid, I am going to have one very annoyed user on my hands. If I am operating in an environment in which "user" is spelled "customer", and I don't have a better reason for rejecting that address than "W3C and WHATWG said it was ok to reject it" I may also be able to have various sales types, managers, and executives in my face.

The fact that email address is being used as a user ID probably answers another question. Suppose the user registers with an email address using native Unicode characters in both the local part and the domain part. Now suppose they come back a few weeks later and try to sign in using the same local part but a domain part that contains A-labels. Should the two be considered to match? Remembering that this is a user ID that has the syntax of an email address, not something that is going to be used exclusively in an email context, I'd say that is a business decision and not some HTML (or browsers, or similar tools) should get into the middle of. There is one exception. One of the key differences between IDNA2003 and IDNA2008 is that, in the latter, U-labels and A-labels are guaranteed to be duals of each other. If the browser or the back-end database system are stuck in IDNA2003 or most interpretations of UTR#46, then the fact that multiple source labels can map to a single punycode-encoded form opens the door to a variety of attacks and anyone deciding that the two are interchangeable in that environment has best be quite careful about what user names they allow and how they are treated.

It may also be a reasonable business decision in some cases for a site to say "we don't accept non-ASCII email addresses as user IDs/ account identifiers" or even "we accept addresses that uses these characters, or characters from a particular set of scripts, and not others". But nothing in the HTML rules about the valid syntax for email address should be in the middle of that decision.

Beyond that, as others have suggested, one just can't know whether an email address is valid without somehow asking the server that hosts the relevant mailbox (or its front end). It may not be possible to ask that question in real time and, even if it is, doing so is likely to require significantly more time (user-visible delay) than browser implementers have typically wanted to invest. So let's stick to syntax

That scenario by itself argues strongly for what I think John, Nico, and others are suggesting: the only validation HTML should be performing on something that is claimed to be an email address is conformity to the syntax restrictions in RFC 6531. Could one be even more liberal than that? Yes, but why bother.

Jul 07 '20 02:07 klensin

I was actioned by the W3C I18N WG with replying to this thread with a sense of the group.

Generally, we concur with @kleinsin's comment just above ⬆️.

We think that type=email should accept non-ASCII addresses the better to permit adoption of EAI and IDNA. One reason for low adoption of these are barriers to using them across the Web/Internet. Removing these types of artificial barriers will not only encourage adoption, but will support those users who are already using these.

Users of this feature in HTML expect that the input value follow the structural requirements of an email address but don't expect the value to be validated to be an actual valid address. At best this amounts to ensuring that there is an @ sign and maybe some other structure that can be found with a regex. Users who want to impose an ASCII restriction or do additional validation are free to do so and mostly have to do this anyway. In our opinion, HTML would thus be best off to provide minimal validation. User agents can use type=input as a hint for additional features (such as prompting the user with their own email address or providing access to the user's address book), but this is outside the realm of HTML itself.

Jul 15 '20 14:07 aphillips

I played with this a bit and it seems the current state is rather subpar, though that also leaves more room for changes. Example input: x@ñ. Firefox submits as-is (percent-encoded). Chrome submits x@xn--ida. Safari rejects (asks me to enter an email address). If you use ñ before the @ all reject (as expected).

One thing that would help here is a precise definition of the validation browsers would be expected to perform if we changed the current definition as well as tests for that. I can't really commit for Mozilla though if we can make this a bit more concrete I'd be happy to advocate for change.

Jul 15 '20 15:07 annevk

@aphillips @annevk just about the only thing worth validating here is the RHS of the @ -- everything else should be left to either the backend (which does or does not support internationalized mailbox names) or the MXes ultimately identified by the RHS of the @, or any MTAs in the path (which might not support internationalized mailbox names, but damn it, should).

What is the most minimal mailbox validation? Certainly: that it's not empty. Validating that the mailbox is not some garbage like just ASCII periods, and so on, might help, but getting that right is probably difficult.

So that's my advice: validate that the given address is of any RFC 5322 form that is ultimately of the form ${lhs}@${rhs}, that the RHS is a domainname, supporting U-labels because this is a UI element, as well as A-labels, and validate that the LHS is not empty, and keep any further LHS validation to the utter minimum, in particular not rejecting non-ASCII Unicode.

Jul 15 '20 17:07 nicowilliams

@annevk, I think your examples actually point out the problem. In order: it would be rare, but not impossible (details on request but I want to keep this relatively short) to see on on the RHS of the "@", and % is prohibited by the syntax in RFC 5321 , but I'd generally recommend the use of percent-encoding in any part of email addresses. Pushing a domain-part through Punycode is prohibited by IDNA unless the labels it contains are validated to be U-labels. I can't tell from your example but if, e.g., the domain -part of the mailbox was \u1D7AA\u1D7C2 then it should be rejected, not encoded with punycode: doing otherwise invites errors down the line, errors for which the user get obscure and/or misleading messages.

The problem is that email addresses with non-ASCII characters in the local-part and/or domain part are now valid and increasing numbers of people who can use them for email are expecting to use them through web interfaces.
Keeping in mind that a browser cannot ever fully "validate" an email address (something that would require knowing that the mailbox [email protected] exists but [email protected] does not) I suggest:

(1) If a mailbox consists of a string of between 1 and 64 octets, an "@", and at least 2 and up to 255 more octets, treat it as acceptable and move on, understanding that all sorts of things may apply additional restrictions in actual email handling.

(2) In addition, if you wanted to and the domain-part contained non-ASCII characters, you could verify that any labels were valid ISDNA2008 U-labels and reject the name if they were not ("invalid domain name in email address:" would be a much better message than "invalid email address") AND, optionally iff the local-part was entirely ASCII, convert those U-labels to A-labels. The SMTPUTF8 ("EAI") specs strongly recommend against making that conversion if the local-part is all-ASCII. When the local part is all-ASCII, the conversion will allow some valid cases to go through but, over time, it seems likely that those cases will become, percentagewise, less frequent so whether it is worth the effort is somewhat questionable.

FWIW, the above was written in parallel with @nicowilliams's comment rather than after studying it, but that his recommendation and mine are not significantly different except for that one marginal case of an ASCII local-part and a non-ASCII (but IDNA2008-valid) domain part.

Jul 15 '20 17:07 klensin

I should have added, as @vdukhovni more or less points out, if one is going to try to validate the syntax of the local-part (even all-ASCII local-parts) if it important to actually get it right. As he shows, getting it right is a moderately complicated process, perhaps best left to email systems that are doing those checks anyway (which is what @nicowilliams and I essentially suggest above). But, if one is going to try to do it, it should be done right because halfway attempts (fuzzy approximations) are harmful, including letting some local-parts with invalid syntax through and prohibiting some valid ones.

Jul 15 '20 18:07 klensin

@klensin I'm not sure what you're trying to convince me of. I was offering to help. (Percent-encoding is just part of the MIME type form submission uses by default, it's immaterial. Chrome's Punycode handling is what is encouraged by HTML today. That browsers do incompatible things suggests it might be possible to change the current handling.)

Jul 16 '20 06:07 annevk

@annevk I drew an action item (during part of I18N's meeting when @klensin was not available) to propose changes and I'd appreciate your thoughts on how to approach this. Looking at the current text, I guess a question is whether we should attempt to preserve the current behavior for ASCII email addresses (or their LHS/RHS parts) while simultaneously allowing labels in that use non-ASCII Unicode? I18N WG participants seem to agree that we don't want to get into deep validation of the address's validity and limit ourselves to "structurally valid" addresses.

Jul 16 '20 15:07 aphillips

Right, e.g., at a minimum we should probably require that the string contains a @ and no surrogates. But currently we also prohibit various types of ASCII labels, e.g., quoted ones, and allowing those to now go through might not be great either.

Jul 16 '20 15:07 annevk

It certainly has to be valid Unicode (e.g., no unpaired UTF-16 surrogates, no invalid UTF-8 bytes), and follow the rules like no unpaired quotes. Restricting it more than that is not likely to help.

Jul 16 '20 16:07 jrlevine

Even if people are just using things that look like email addresses for purposes other than sending email, do you really want to allow unnormalized Unicode or leading or trailing white space in the LHS? for sites that use email addresses as user IDs, changing HTML validation to allow entry of different sequences that are visually identical opens up new security concerns.

Jul 16 '20 18:07 masinter

@masinter Absolutely this must allow unnormalized Unicode because users cannot be counted to produce normalized Unicode. Regarding whitespace, trimming it is fine. I don't think there are any security concerns regarding client-side validation -- if there is a site where relaxing client-side validation of email addresses creates a security concern, then the site is already vulnerable.

Jul 16 '20 18:07 nicowilliams

Mailbox names are pretty much arbitrary UTF-8. It doesn't have to be normalized, for that matter, it can be a sequence of ZWJ and Arabic combining marks. While I agree that no sensible mail provider would use names like that, we don't get to tell people to be sensible. White space has to be quoted so unquoted trailing whitespace isn't valid, although unquoted NBSP and NNBSP is.

Jul 16 '20 18:07 jrlevine

@aphillips @annevk See above. Do less validation. Validate only:

Unicode well-formedness
balanced quotes
that there's an @
that the RHS of the @ is a ~~domainname~~ hostname, allowing both, U-labels and A-labels (well, that's hard enough to do -- basically, there has to be at least one .)

In all cases allow Unicode throughout.

Trim whitespace, sure.

Anything else?

Jul 16 '20 18:07 nicowilliams

the RHS has to be a hostname, which limits the characters to the ones valid in U-labels

Jul 16 '20 18:07 jrlevine

Validating internationalized mail addresses in Last time around the consensus seemed to be that EAI input fields should be marked as unicode or eai or the like, since it'll be a while since all mail systems handle EAI.

I think there are likely a large number of sites that use and aren't prepared to deal with spoofing, normalization or untypable addresses injected . Rather than introduce that kind of vulnerability by changing what type="email" means for them, make adding EAI support an explicit step.

Jul 16 '20 19:07 masinter

@masinter User's don't distinguish between entering [email protected] and персон@еџампле.ру when using email. If we create indistinguishable input boxes for this, users and content authors will be confused by the difference. It creates another barrier to more-widespread adoption of IDN and SMTPUTF8. The end-to-end folks have been pestering us (I18N) for years about this. Since browsers are inconsistent anyway and users need to process the values they are sent (which already have spoofing or other garbage injection possibilities), this is an opportunity to be done with the problem.

Would an alternative be to add a "legacy" attribute?

@nicowilliams foo@localhost doesn't have a dot. That's one reason (among several) that the current regex makes * ('.' label) on RHS optional.

Jul 16 '20 20:07 aphillips

@aphillips Really, users input foo@localhost into these elements? Fine.

I agree with you regarding not wanting to type EAI vs. not-EAI. Users don't and shouldn't have to know.

@masinter

I think there are likely a large number of sites that use and aren't prepared to deal with spoofing, normalization or untypable addresses injected . Rather than introduce that kind of vulnerability by changing what type="email" means for them, make adding EAI support an explicit step.

Again, if relaxing client-side validation "causes" a security problem, then the security problem already exists. Relaxing client-side validation cannot cause a security problem on the server side!

Also, the server-side that gets a form with email address inputs should NOT normalize the mailbox part. Leave that to mail software, specifically the last hop MTA should normalize the mailbox part if at all (it could use form-insensitive matching of mailbox names). The mailbox part is for all intents and purposes opaque to all relays.

Jul 16 '20 20:07 nicowilliams

I forgot -- form fields (including those with type="email") are encoded using the charset of the form, not utf8. so anyone trying to enter an EAI into an input-field in a (non utf8) form will have trouble because there is no way to represent the characters.

Jul 16 '20 22:07 masinter

html html copied to clipboard

Validating internationalized mail addresses in <input type="email">

html
html copied to clipboard