standards-and-practices
standards-and-practices copied to clipboard
Create a standard email field verification Regular Expression (or find and verify one)
https://en.wikipedia.org/wiki/Email_address
There are some crazy email addresses allowed in RFC 5321 and RFC 5322. Here is the above articles set of rules, and examples of valid and invalid addresses.
Local-part
The local-part of the email address may use any of these [[ASCII]] characters:
-
uppercase and lowercase [[Basic Latin (Unicode block)|Latin]] letters
AtoZandatoz; -
digits
0to9; -
special characters
!#$%&'*+-/=?^_`{|}~; -
dot
., provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g.[email protected]is not allowed but"John..Doe"@example.comis allowed);
Note that some mail servers wildcard local parts, typically the characters following a plus and less often the characters following a minus, so fred+bah@domain and fred+foo@domain might end up in the same inbox as fred+@domain or even as fred@domain. This can be useful for tagging emails for sorting, see below, and for spam control. Braces { and } are also used in that fashion, although less often.
- space and
"(),:;<>@[]characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash); - comments are allowed with parentheses at either end of the local-part; e.g.
john.smith(comment)@example.comand(comment)[email protected]are both equivalent to[email protected].
In addition to the above ASCII characters, international characters above U+007F, encoded as [[UTF-8]], are permitted by RFC 6531, though even mail systems that support SMTPUTF8 and 8BITMIME may restrict which characters to use when assigning local-parts.
Domain
The [[domain name]] part of an email address has to conform to strict guidelines: it must match the requirements for a [[hostname]], a list of dot-separated [[DNS]] labels, each label being limited to a length of 63 characters and consisting of:{{rp|§2}}
- uppercase and lowercase [[Basic Latin (Unicode block)|Latin]] letters
AtoZandatoz; - digits
0to9, provided that top-level domain names are not all-numeric; - hyphen
-, provided that it is not the first or last character. This rule is known as the ''LDH rule'' (letters, digits, hyphen). In addition, the domain may be an [[IP address]] literal, surrounded by square brackets[], such asjsmith@[192.168.2.1]orjsmith@[IPv6:2001:db8::1], although this is rarely seen except in [[email spam]]. [[Internationalized domain name]]s (which are encoded to comply with the requirements for a [[hostname]]) allow for presentation of non-ASCII domains. In mail systems compliant with RFC 6531 and RFC 6532 an email address may be encoded as [[UTF-8]], both a local-part as well as a domain name.
Comments are allowed in the domain as well as in the local-part; for example, john.smith@(comment)example.com and [email protected](comment) are equivalent to [email protected].
Examples
Valid email addresses
[email protected][email protected][email protected][email protected][email protected][email protected](may go to[email protected]inbox depending on mail server)[email protected](one-letter local-part)"very.(),:;<>[]".VERY."very@\ "very".unusual"@strange.example.com[email protected]admin@mailserver1(local domain name with no [[Top-level domain|TLD]], although ICANN [https://www.icann.org/news/announcement-2013-08-30-en highly discourages] dotless email addresses)#!$%&'*+-/=?^_`{}|[email protected]"()<>[]:,;@\"!#$%&'-/=?^_`{}| ~.a"@example.org[email protected](see the [[List of Internet top-level domains]])user@[2001:DB8::1]" "@example.org(space between the quotes)
Invalid email addresses
Abc.example.com(no @ character)A@b@[email protected](only one @ is allowed outside quotation marks)a"b(c)d,e:f;g(none of the special characters in this local-part are allowed outside quotation marks)i[j\k][email protected] just"not"[email protected](quoted strings must be dot separated or the only element making up the local-part)this is"not\[email protected](spaces, quotes, and backslashes may only exist when within quoted strings and preceded by a backslash)this\ still"not\[email protected](even if escaped (preceded by a backslash), spaces, quotes, and backslashes must still be contained by quotes)1234567890123456789012345678901234567890123456789012345678901234+x@example.com(local part is longer than 64 characters)[email protected](double dot before @)[email protected](double dot after @)
This is a promising solution if we could figure out a way to standardize it for our projects. https://github.com/django/django/blob/master/django/core/validators.py#L164
Plus lots of examples and resources here: http://emailregex.com/
Stackoverflow answer has a pretty awesome regexp pattern with a state machine diagram: https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression
But considering that different languages have different regexp syntaxes it might be better to designate a validation library for each language we use. For nodejs isemail looks pretty robust: https://github.com/hapijs/isemail/blob/master/test/tests.json
I would like to humbly propose a solution which performs as well as the RFC5322 Official Standard (in my particular test set) but is much easier to understand and verify.
^(?!\.)(?!.*?\.(\.|@))[\w\d.!#$%&'*+\-\/=?^_`{|}~]+@[\w\d.-]+\.[\w\d]{2,}$
^- start of line(?!\.)- don't allow the line to start with.(?!.*?\.(\.|@))- don't allow consecutive periods, ex. ([email protected]). Also don't allow a period at the end of the local part, ex ([email protected])[\w\d.!#$%&'*+\-\/=?^_`{|}~]+- match one or more letters, numbers, and these special characters:.!#$%&'*+-/=?^_`{|}~@- match the literal character@[\w\d.-]+- match one or more letter, digit, period (.), or hyphen (-)\.- match a period (.)[\w\d]{2,}- match 2 or more letters and numbers$- end of line
This regex can be tested here: https://regex101.com/r/A9jZZ4/4 This is not meant to be a perfect solution, but should cover 99% of email addresses Shift3 would expect to deal with, while catching some basic mistakes for user convenience. It does NOT handle extended ASCII / international characters, which the RFC 5322 standard does.
The following email addresses expectedly pass this validation:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
1234567890123456789012345678901234567890123456789012345678901234+x@example.com
The following email addresses expectedly fail this validation:
[email protected]
@test.com
admin@mailserver1
"()<>[]:,;@\\\"!#$%&'-/=?^_`{}| ~.a"@example.org
user@[2001:DB8::1]
" "@example.org
[email protected]
"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com
Abc.example.com
A@b@[email protected]
a"b(c)d,e:f;g<h>i[j\k][email protected]
just"not"[email protected]
this is"not\[email protected]
this\ still\"not\\[email protected]
[email protected]
[email protected]
[email protected].
I would appreciate if others would throw some other test cases against this regex and try to break it.
For reference, here is the RFC 5322 Standard I am comparing against.
^(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])$
Found at http://emailregex.com/

But for reals, reading this issue is fantastic. I like the solution proposed at the end, and the amount of testing done against it. 👍
Running through the validation examples from isemail against
^(?!\.)(?!.*?\.(\.|@))[\w\d.!#$%&'*+\-\/=?^_`{|}~]+@[\w\d.-]+\.[\w\d]{2,}$
Most notable is the lack of UTF8 support and hyphen handling.
False positives:
[email protected]
a@abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijkl.hijk
[email protected]
[email protected]
[email protected]
False negatives:
ê[email protected]
ñoñó[email protected]
test@\uD800\uD800ñoñó郵件ñoñó郵件.ñoñó郵件ñoñoñó郵件ñoñó.郵件ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.oñó郵件ñoñó郵件ñoñó郵件.商務
Hyphen support I'm not as concerned with, in terms of hitting that balance of simplicity vs. complete accuracy to RFC 5322. A false positive is not a big deal, vs a false negative which would stop a valid user from accessing a service. With that in mind, the false negatives do seem like a problem. How common is UTF8 support with the major email providers? And what percentage of users would hit that use case? If we're talking < 1 %, I would rather just tell a user to use a different email address.
Let me know what you guys think.
Personally, I've known people from multiple people groups in various parts of the world, and as far as I recall, almost all of them used plain ANSI characters in their email addresses, web addresses, and IM'ing. So I don't think UTF-8 support is a big deal.
Frankly, I think it is more important to adopt a library for this concern then to bless a regex to be copied for all projects. Having a small clever regex pattern to stamp out is cool but it runs afoul with the DRY principle: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself. The argument for simplicity makes more sense if we're the one's maintaining the code, which for something as common as email validation, can we not?
Emoji is another reason to support UTF8: https://medium.com/@zackbloom/i-have-a-unicode-email-address-fbecd630ec12
If we're good at out jobs, our software should live to see a day when UTF-8 is more common in email addresses. Since we're here to address email validation, let's do it so we don't have to again.
I don't disagree. My goal in this particular task was to discover a good front-end validation for email which gives a user immediate feedback to avoid typos, not necessarily to vet and validate all possible correct email addresses (we can leave that to the 3rd party email service).
The issue I see with using someone else's library for this is that we support and develop for many frontend frameworks (ionic, react, .net mvc, nativescript, xamarin..... ) One library would not work across all of those. A regex line would.
I imagine this being the beginning of a shift3 internal library of common functions, which we could build out for all of our primary development . If these things were rolled into our own libraries, we'd be respecting DRY way more than we do nowadays (across projects, not necessarily per individual project).
@zbyte64 I'm definitely open for other suggestions as well. Let me know if there was a particular library you had in mind, or if there is something you're already doing on your projects that you really like.
@michaelachrisco 3 years later and this is still a recurring issue in projects. Now that we have boilerplates to implement a standard, I think this is a good time to resurface this.
Now that we're supporting locale translation in the boilerplates, I think the UTF-8 argument has some more strength behind it.
I suspect for client-side validation we will still be served best by simple and permissive validation, as opposed to strict and technical. What do you think?
Adding that I agree with Justin Schiff's assessment here:
@coreyshuman I would normally agree, but what i'm trying to make clear here is that complicated email regex is not the preferred pattern for signup or email validation anyway. Attempting to send an email to the address specified is. Provided a permissive regex, or none at all (or just asking the user to enter their email twice) while sending a confirmation email, is a 100% method to ensure you end up with a valid email address, and 100% method to make sure you have no false negatives.
When you run into an "edge case" in your complicated regular expression you have to do the follow -> find the fix, hope you don't implement a regression possibly in other untested parts of the regex -> backport to all running applications using the old regex -> make sure all old versions of applications are updated -> etc. etc. etc.
I think that have an email regex may be valuable for things other than sign up fields, but I want it to be clear that in my opinion for sign in/sign up this is not the preferred pattern of validation, nor does it enhance security.
Originally posted by @DropsOfSerenity in https://github.com/Shift3/standards-and-practices/issues/130#issuecomment-541257822
I noticed we do have an example documented in best practices here: https://github.com/Shift3/standards-and-practices/tree/main/best-practices/development-tools/validation#code-example
For this to be a completed standard, we should include a definition for our goal on what should and shouldn't pass this validation. It should also include a set of unit tests to verify that goal.
The current RegEx in the Angular boilerplate is the following:
/^[a-z0-9!#$%&'*+\/=?^_\`{|}~.-]+@[a-z0-9]([a-z0-9-])+(\.[a-z0-9]([a-z0-9-]*[a-z0-9])?)*$/i
For the test sets you provided above, all of the ones that should match do, and the commented out ones below that should fail pass.
const failingValues: string[] = [
// '[email protected]', //
'@test.com',
// 'admin@mailserver1', //
`"()<>[]:,;@\\\"!#$%&'-/=?^_\`{}| ~.a"@example.org`,
'user@[2001:DB8::1]',
'" "@example.org',
'[email protected]',
'"very.(),:;<>[]".VERY."very@\\ "very".unusual"@strange.example.com',
'Abc.example.com ',
'A@b@[email protected]',
'a"b(c)d,e:f;g<h>i[jk][email protected]',
'just"not"[email protected]',
'this is"[email protected]',
'this still"not\\[email protected]',
// '[email protected]', //
'[email protected]',
'[email protected].',
];
I do have unit tests for the validator using the regular expression, but I can add the test sets as follows:
describe('[Unit] EmailValidation validEmail() Required', () => {
const urlValidator = EmailValidation.validEmail(true);
const emailControl = new FormControl('');
const matchingValues: string[] = [
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'1234567890123456789012345678901234567890123456789012345678901234+x@example.com',
];
const failingValues: string[] = [
'@test.com',
`"()<>[]:,;@\\\"!#$%&'-/=?^_\`{}| ~.a"@example.org`,
'user@[2001:DB8::1]',
'" "@example.org',
'[email protected]',
'"very.(),:;<>[]".VERY."very@\\ "very".unusual"@strange.example.com',
'Abc.example.com ',
'A@b@[email protected]',
'a"b(c)d,e:f;g<h>i[jk][email protected]',
'just"not"[email protected]',
'this is"[email protected]',
'this still"not\\[email protected]',
'[email protected]',
'[email protected].',
];
it(`should return null if value matches a list of values that should work`, () => {
matchingValues.forEach((value) => {
emailControl.setValue(value);
expect(urlValidator(emailControl)).toEqual(null);
});
});
it(`should return { invalidEmail: 'Please enter a valid email.' } if value matches a list of values that should fail`, () => {
failingValues.forEach((value) => {
emailControl.setValue(value);
const expectedValue = {
invalidEmail: 'Please enter a valid email.',
};
expect(urlValidator(emailControl)).toEqual(expectedValue);
});
});
});
We can decide if we want to keep the current RegEx, change it, and add the above test values.
Either way, the boilerplate also follows the recommendations that @DropsOfSerenity posted above: it requires confirming the email address and sends an activation email to that account.
@michaelachrisco 3 years later and this is still a recurring issue in projects. Now that we have boilerplates to implement a standard, I think this is a good time to resurface this.
Now that we're supporting locale translation in the boilerplates, I think the UTF-8 argument has some more strength behind it.
I suspect for client-side validation we will still be served best by simple and permissive validation, as opposed to strict and technical. What do you think?
@coreyshuman I agree with making validation simple and permissive as you stated. If we get too strict with the REGEX/standard, we may get quite a few false positives (I remember a few horror projects I worked on in the EDI world). Emojis are now valid email addresses. Its a strange world we live in.
I also like the example @Karvel shows by adding real email addresses to the unit tests for each of the valid/invalid emails. As time goes on, this list will naturally expand as we find a user with some strange valid email address that we will need to accommodate and we can just add that to the unit test/fix.
Most of the projects I have worked on in the past has stolen or use thee default MDN example here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/email and called it a day.
/^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}
[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/
This, of course, leaves in bugs (like https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489) but it does seem to be "good enough" for most.
I feel like we could add unit tests to the examples https://github.com/Shift3/standards-and-practices/tree/main/best-practices/development-tools/validation#code-example but a better place would probably be in the boilerplate projects.
FWIW, I also agree with making validation simple and permissive. And with requiring confirmation emails. I think something like @Karvel 's regex or the MDN one @michaelachrisco mentioned would probably work well.