protovalidate [Feature Request] Rigerously specify e-mail address validation

[Feature Request] Rigerously specify e-mail address validation

Open jchadwick-buf opened this issue 7 months ago • 0 comments

Feature description: Email address validation is underspecified and underdocumented, and protovalidate implementations in different languages use very different e-mail parsing codepaths leading to different validation results in edge cases. E-mail validation should be rigorously specified and implemented consistently across languages, as the results of validation should be consistent across programming languages.

Furthermore, the e-mail validation should be as minimally surprising as possible, so we should leverage existing industry standards as much as possible, particularly ones that reflect the real world and don't hinder e.g. internationalization.

Also, the conformance test suite should be expanded to ensure that the edge cases are consistent across implementations.

Proposed implementation or solution: I suggest we use the e-mail validation specified in the WHATWG HTML standard, for the following reasons:

It is the validation format adopted by web browsers for <input type="email">
RFC 5322, the standard that authoritatively defines e-mail address formatting, is woefully out of touch with real-world implementations.
Standards that build on RFC 5322, like RFC 6531 which adds support for internationalized e-mail addresses, are often incomplete and ambiguous, and often themselves not standardized.
We can lean on regex engines to implement it if we want. Chrome uses it this way, and it is a simple enough regex that it should work fine in more restrictive engines like re2. Since the grammar is very simple and has few productions, hand-written parsers should also be very easy to implement.

I did some exploration into what it would look like to implement RFC 5322-based e-mail address validation, which I will provide here:

Exploring RFC 5322 for e-mail address validation

RFC 5322 rules

Here is a summary of the grammar productions relevant to the local-part of an e-mail address, according to RFC 5322. Per our current validation, productions beginning with 'obs-' should probably be disallowed, as well as productions allowing folding whitespace within e-mail addresses.

We'll ignore the address part, since protovalidate already has an approach to validating hostnames anyways.

; rfc5234 rules
ALPHA           =   %x41-5A / %x61-7A  ; A-Z / a-z
CR              =   %x0D               ; carriage return
LF              =   %x0A               ; linefeed
CRLF            =   CR LF              ; Internet standard newline
DIGIT           =   %x30-39            ; 0-9
DQUOTE          =   %x22               ; " (Double Quote)
HTAB            =   %x09               ; horizontal tab
SP              =   %x20
VCHAR           =   %x21-7E            ; visible (printing) characters
WSP             =   SP / HTAB          ; white space
; folding whitespace
obs-FWS         =   1*WSP *(CRLF 1*WSP)
FWS             =   ([*WSP CRLF] 1*WSP) /  obs-FWS
ctext           =   %d33-39 /          ; Printable US-ASCII
                    %d42-91 /          ;  characters not including
                    %d93-126 /         ;  "(", ")", or "\"
                    obs-ctext
ccontent        =   ctext / quoted-pair / comment
comment         =   "(" *([FWS] ccontent) [FWS] ")"
CFWS            =   (1*([FWS] comment) [FWS]) / FWS
; atom
atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                    "!" / "#" /        ;  characters not including
                    "$" / "%" /        ;  specials.  Used for atoms.
                    "&" / "'" /
                    "*" / "+" /
                    "-" / "/" /
                    "=" / "?" /
                    "^" / "_" /
                    "`" / "{" /
                    "|" / "}" /
                    "~"
atom            =   [CFWS] 1*atext [CFWS]
; quoted string
qtext           =   %d33 /             ; Printable US-ASCII
                    %d35-91 /          ;  characters not including
                    %d93-126 /         ;  "\" or the quote character
                    obs-qtext
quoted-pair     =   ("\" (VCHAR / WSP)) / obs-qp
qcontent        =   qtext / quoted-pair
quoted-string   =   [CFWS]
                    DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                    [CFWS]
word            =   atom / quoted-string
; obsolete productions
obs-NO-WS-CTL   =   %d1-8 /            ; US-ASCII control
                    %d11 /             ;  characters that do not
                    %d12 /             ;  include the carriage
                    %d14-31 /          ;  return, line feed, and
                    %d127              ;  white space characters
obs-ctext       =   obs-NO-WS-CTL
obs-qtext       =   obs-NO-WS-CTL
obs-qp          =   "\" (%d0 / obs-NO-WS-CTL / LF / CR)
obs-local-part  =   word *("." word)
; dot-atom
dot-atom-text   =   1*atext *("." 1*atext)
dot-atom        =   [CFWS] dot-atom-text [CFWS]
; local part
local-part      =   dot-atom / quoted-string / obs-local-part

Simplified RFC 5322 Rules

Here's a version of the above rules with whitespace disallowed outside of quotes and escapes and with obsolete productions removed.

; rfc5234 rules
ALPHA           =   %x41-5A / %x61-7A  ; A-Z / a-z
CR              =   %x0D               ; carriage return
LF              =   %x0A               ; linefeed
CRLF            =   CR LF              ; Internet standard newline
DIGIT           =   %x30-39            ; 0-9
DQUOTE          =   %x22               ; " (Double Quote)
HTAB            =   %x09               ; horizontal tab
SP              =   %x20
VCHAR           =   %x21-7E            ; visible (printing) characters
WSP             =   SP / HTAB          ; white space
; folding whitespace
FWS             =   ([*WSP CRLF] 1*WSP)
ctext           =   %d33-39 /          ; Printable US-ASCII
                    %d42-91 /          ;  characters not including
                    %d93-126           ;  "(", ")", or "\"
ccontent        =   ctext / quoted-pair / comment
comment         =   "(" *([FWS] ccontent) [FWS] ")"
CFWS            =   (1*([FWS] comment) [FWS]) / FWS
; atom
atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                    "!" / "#" /        ;  characters not including
                    "$" / "%" /        ;  specials.  Used for atoms.
                    "&" / "'" /
                    "*" / "+" /
                    "-" / "/" /
                    "=" / "?" /
                    "^" / "_" /
                    "`" / "{" /
                    "|" / "}" /
                    "~"
; quoted string
qtext           =   %d33 /             ; Printable US-ASCII
                    %d35-91 /          ;  characters not including
                    %d93-126.          ;  "\" or the quote character
quoted-pair     =   ("\" (VCHAR / WSP))
qcontent        =   qtext / quoted-pair
quoted-string   =   DQUOTE *([FWS] qcontent) [FWS] DQUOTE
; dot-atom
dot-atom        =   1*atext *("." 1*atext)
; local part
local-part      =   dot-atom / quoted-string

Regular expression translation

It is possible to express this entire grammar using regular expressions, since it doesn't need backtracking or recursion.

; quoted string
qtext           =   /[\x21\x23-\x5b\x5d-\x7e]/
quoted-pair     =   /\\[ \t\x21-\x7E]/
qcontent        =   /[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E]/
quoted-string   =   /"((([ \t]*[\r\n])?[ \t]+)?[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E])*(([ \t]*[\r\n])?[ \t]+)?"/
; dot-atom
atext           =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]/
dot-atom        =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+(\.[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+)*/
; local part
local-part      =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+(\.[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+)*|"((([ \t]*[\r\n])?[ \t]+)?[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E])*(([ \t]*[\r\n])?[ \t]+)?"/

Pseudo-code form

The above regular expression is unreadable and probably pretty slow. Here is the same grammar parsed with Go-like pseudo-code.

matchLocalPart returns the email address after the '@' if the local-part is valid, or an empty string if it is not.

Note that RFC 5322 does not allow for localpart to contain non-US ASCII characters yet. RFC 6531 proposes allowing non-ASCII characters, but it is still in the proposal stage. Either way, we can work on the byte level since we do not care about codepoints above 0x7F. (If we want to adopt the RFC 6531 behavior at any point, I believe we just want to allow >= 0x80 in qtext and atext.)

func matchLocalPart(email string) string {
	if len(email) == 0 {
		return ""
	}
	if email[0] == '"' {
		if email = matchQuotedString(email); len(email) == 0 {
			return ""
		}
	} else if isAText(email[0]) {
		if email = matchDotAtom(email); len(email) == 0 {
			return ""
		}
	}
	if email[0] != '@' {
		return ""
	}
	return email[1:]
}

func matchQuotedString(email string) string {
	email = email[1:]
	for {
		if len(email) == 0 {
			return ""
		}
		switch email[0] {
		case '"':
			return email[1:]
		case '\\':
			if email = email[1:]; len(email) == 0 {
				return ""
			}
			if !isQuotedPair(email[0]) {
				return ""
			}
			email = email[1:]
		default:
			if !isQText(email[0]) && !isWSP(email[0]) {
				return ""
			}
			email = email[1:]
		}
	}
}

func matchDotAtom(email string) string {
	for {
		if len(email) == 0 {
			return ""
		}
		switch email[0] {
		case '@':
			return email
		case '.':
			if email = email[1:]; len(email) == 0 {
				return ""
			}
			fallthrough
		default:
			if !isAText(email[0]) {
				return ""
			}
			email = email[1:]
		}
	}
}

func isAText(b byte) bool {
	return (b >= 'a' && b <= 'z') ||
		(b >= 'A' && b <= 'Z') ||
		(b >= '0' && b <= '9') ||
		b == '!' || b == '#' || b == '$' || b == '%' ||
		b == '&' || b == '*' || b == '+' || b == '-' ||
		b == '/' || b == '=' || b == '?' || b == '^' ||
		b == '_' || b == '`' || b == '{' || b == '|' ||
		b == '}' || b == '~' || b == '\''
}

func isQText(b byte) bool {
	return b == '!' || (b >= '#' && b <= '[') || (b >= ']' && b <= '~')
}

func isQuotedPair(b byte) bool {
	return b == ' ' || b == '\t' || (b >= 0x21 && b <= 0x7e)
}

func isWSP(b byte) bool {
	return b == ' ' || b == '\t' || b == '\r' || b == '\n'
}

Here is a similar implementation in Python. This is written to work on a memoryview since it is more efficient to slice a memoryview than a str. Unlike the Go version, this version uses exception handling for errors.

from typing import Sequence

_AT = ord('@')
_DQUOTE = ord('"')
_BACKSLASH = ord('\\')
_PERIOD = ord('.')

def _match_local_part(email: Sequence[int]) -> Sequence[int]:
    if len(email) == 0:
        raise Exception('Empty address')
    if email[0] == _DQUOTE:
        email = _match_quoted_string(email)
    elif _is_atext(email[0]):
        email = _match_dot_atom(email)
    if email[0] != _AT:
        raise Exception('Invalid address')
    return email[1:]

def _match_quoted_string(email: Sequence[int]) -> Sequence[int]:
    email = email[1:]
    while True:
        if len(email) == 0:
            raise Exception('Unexpected end of address')
        elif email[0] == _DQUOTE:
            return email[1:]
        elif email[0] == _BACKSLASH:
            email = email[1:]
            if len(email) == 0:
                raise Exception('Unexpected end of address')
            if not _is_quoted_pair(email[0]):
                raise Exception('Invalid quoted pair')
            email = email[1:]
        else:
            if not _is_qtext(email[0]) and not _is_wsp(email[0]):
                raise Exception('Invalid local part')
            email = email[1:]

def _match_dot_atom(email: Sequence[int]) -> Sequence[int]:
    while True:
        if len(email) == 0:
            raise Exception('Unexpected end of address')
        if email[0] == _AT:
            return email
        elif email[0] == _PERIOD:
            email = email[1:]
            if len(email) == 0:
                raise Exception('Unexpected end of address')
        if not _is_atext(email[0]):
            raise Exception('Invalid character')
        email = email[1:]

def _is_atext(b: int) -> bool:
    return (
        (b >= 0x61 and b <= 0x7a) or
        (b >= 0x41 and b <= 0x5a) or
        (b >= 0x30 and b <= 0x39) or
        b == 0x21 or b == 0x23 or b == 0x24 or b == 0x25 or
        b == 0x26 or b == 0x27 or b == 0x2a or b == 0x2b or
        b == 0x2d or b == 0x2f or b == 0x3d or b == 0x3f or
        b == 0x5e or b == 0x5f or b == 0x60 or b == 0x7b or
        b == 0x7c or b == 0x7d or b == 0x7e
    )

def _is_qtext(b: int) -> bool:
    return b == 0x21 or (b >= 0x23 and b <= 0x5b) or (b >= 0x5d and b <= 0x7e)

def _is_quoted_pair(b: int) -> bool:
    return b == 0x20 or b == 0x09 or (b >= 0x21 and b <= 0x7e)

def _is_wsp(b: int) -> bool:
    return b == 0x20 or b == 0x09 or b == 0x0d or b == 0x0a

Summary

Implementing RFC 5322 rules in a readable fashion is doable in most target languages using a hand-written parser. It can be done in under 100 lines.

However, while this parser is strict enough to adhere to RFC 5322, it has the caveat that it may be both more strict and more lenient than some real world mail servers in some situations, so it is far from ideal.

An implementation of the WHATWG HTML would be very trivial. The local-part of the HTML version is a strict subset of the RFC 5322 version; specifically, it is almost identical to the dot-atom-text production, and the matchDotAtom/_match_dot_atom psuedo-code examples should be a near match (after allowing codepoints above 0x7f in atext.) Meanwhile, the hostname portion of the e-mail in the WHATWG HTML standard seems to also be a near-exact match for our existing hostname validation that we already also use for e-mail.

Aug 05 '24 23:08 jchadwick-buf

protovalidate protovalidate copied to clipboard

[Feature Request] Rigerously specify e-mail address validation

RFC 5322 rules

Simplified RFC 5322 Rules

Regular expression translation

Pseudo-code form

Summary

protovalidate
protovalidate copied to clipboard