OpaqueMail icon indicating copy to clipboard operation
OpaqueMail copied to clipboard

"Found the next address, delimited by a comma"

Open Kermer opened this issue 7 years ago • 10 comments

MailAddressCollection.Parse fails to parse address when it contains comma in the display name. It splits single address into multiple ones which leads to having some invalid address(es) (first part of display name) and valid address with incorrect displayname. image image In the example above it's From address, so when I use MailMessage.From I get the first, invalid, address from the 1st screenshot.

Kermer avatar Jan 03 '18 12:01 Kermer

I can add that the raw "From" header looks like this image

Kermer avatar Jan 03 '18 12:01 Kermer

Seems like the parser is parsing mail addresses after decoding the header which is causing this sort of problem (unquoted commas in address headers are meant to split addresses).

The parser should be fixed to parse the headers and then decode the names.

jstedfast avatar Jan 03 '18 12:01 jstedfast

The patch is wrong :-\

Don't decode the address until after you have completely parsed it and determined which part is the name and which part is the address and then you only want to decode the name portion.

Also: yikes, that address parser code is very... brute force. It should really be token-based like the parser I wrote in MimeKit.

Main loop for parsing address lists: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddressList.cs#L546

Logic for parsing an individual address (used by above loop): https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L555

You'll also note that MimeKit's parser handles unquoted commas in the name portion of the address as well (see https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L623).

jstedfast avatar Jan 04 '18 09:01 jstedfast

Keep in mind that the rfc2047 encoded-word token, once decoded, might contain <>'s or @'s or any other character that will screw up the logic in the current MailAddressCollection.Parse() method.

jstedfast avatar Jan 04 '18 09:01 jstedfast

For anyone that wants to try a different approach to my parser in MimeKit, you could always try parsing the address list in reverse. This approach probably helps simplify the parser logic a bit because parsing forward makes it difficult to know what the tokens belong to (is it the name token? or is it the local-part of an addr-spec? hard to know until I consume a few more tokens...).

For example, consider the following BNF grammar:

address         =       mailbox / group
mailbox         =       name-addr / addr-spec
name-addr       =       [display-name] angle-addr
angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group           =       display-name ":" [mailbox-list / CFWS] ";"
                        [CFWS]
display-name    =       phrase
word            =       atom / quoted-string
phrase          =       1*word / obs-phrase
addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part
domain          =       dot-atom / domain-literal / obs-domain
obs-local-part  =       word *("." word)

Now consider the following email address: "Joe Example" <[email protected]>

The first token you read will be "Joe Example" and you might think that that token indicates that it is the display name, but it doesn't. All you know is that you've got a quoted-string token. A quoted-string can be part of a phrase or it can be (a part of) the local-part of the address itself. You must read at least 1 more token before you'll be able to figure out what it actually is (obs-local-part makes things slightly more difficult). In this case, you'll get a '<' which indicates the start of an angle-addr, allowing you to assume that the quoted-string you just got is indeed the display-name.

If, however, you parse the address in reverse, things become a little simpler because you know immediately what to expect the next token to be a part of.

It's been an approach that I've been meaning to try, but since my current parser in MimeKit works well, I haven't been able to summon the enthusiasm to rewrite it (as the old saying goes, "if it ain't broke, don't fix it").

jstedfast avatar Jan 04 '18 10:01 jstedfast

It may be a good idea to test against all of the address examples in my unit tests as well: https://github.com/jstedfast/MimeKit/blob/master/UnitTests/InternetAddressListTests.cs

For example, the patch here will likely break for addresses like this:

To: "Worthington, William" <[email protected]>

You can't just assume that all commas split addresses.

jstedfast avatar Jan 04 '18 10:01 jstedfast

With the newest commit (forget the first one) from the PR, but sadly tested only using Mozilla Thunderbird client, I tested out both encoded (=?iso-8859-2? [...] ?= <[email protected]>) and unencoded (" [...] " <[email protected]>) addresses.

I gotta agree that the Parse method is a bit messy, but after that fix it seems to work fine for me. e.g.

  1. To: "Worthington, William" <[email protected]> Was and is working fine, as it's not encoded, so symbols between quotes are not threated as special symbols
  2. To: =?utf-8?Q?Worthington=2C_William?= <[email protected]> (decoded: Worthington, William <[email protected]>) When it doesn't have quotes around it (which is usually the case in encoded names) - earlier it would be split into Worthington and William <[email protected] (thanks to that comma). If it would be From instead of To you would get invalid address on MailMessage.From With this fix it's now decoded to "Worthington, William" <[email protected]> and after that we basically go back to point 1.

I'm saw some emails encoded as =?iso-8859-2?B? and they seemed to be working fine as well.

Kermer avatar Jan 04 '18 10:01 Kermer

Okay, cool.

jstedfast avatar Jan 04 '18 12:01 jstedfast

Ok. While this fixed the address issue, it introduced a new bug as apparently Functions.DecodeMailHeader wasn't the best place to put it in 😢. Namely now other headers get quoted and another interesting thing is that the subject headers (so probably other headers as well) are split into parts of max. 32 chars. So instead of getting subject like Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean et sagittis nisl. it becomes "Lorem ipsum dolor sit amet, cons""ectetur adipiscing elit. Aenean ""et sagittis nisl."

Kermer avatar Jan 10 '18 09:01 Kermer

Above gets patched with e6cd1fcd0d30f1be837f255efbe757ed89075fb4 , hopefully without any further surprises.

Kermer avatar Jan 10 '18 09:01 Kermer