incubator-ponymail
incubator-ponymail copied to clipboard
Bug: import/archive don't unfold headers before storage in ES
The mail parser returns the headers as-is, including line-wraps.
This is what is wanted for the raw email source, but is not really suitable for fields such as In-Reply-To, References etc.
It seems there is no unfold method in the Python email or mailbox modules, so it looks like it is necessary to write one.
Header values that have not been folded cannot contain CRLF, so unfolding should just be a matter of stripping these out. There should be no need to check if the CRLF is followed by whitespace (assuming the folding has been done correctly).
It might make sense to compress runs of WS to a single space in case there is some variation. This should make it easier to match things like In-Reply-To.
Another header that should be unfolded is Subject:
When a line is folded, the MTA inserts CRLF followed by a single WSP; unfolding should be the reverse. Multiple WSP should be treated as a single WSP when it separates tokens
The issue also affects Message-ID which can be wrapped within the value.
Note that some MTAs may fold at 78 chars; some may fold at a longer line length. So the path that an email takes to the archiver may affect the raw layout. Potentially an email that is sent to multiple lists could travel by different routes to the lists and from the lists to the archiver. The headers (and body) may end up with different folds.
Failure to unfold headers can also result in invalid Message-IDs being stored, for example:
$ curl -s 'https://lists.apache.org/api/thread.lua?id=4b36027f9230b84f388d719e684536a910b53c304b2c2b05f5efe0fe@1135257469@%3Cusers.continuum.apache.org%3E' | python3 -m json.tool | fgrep message "message-id": "\n 0775DD7F2F88084AA05BCC79EF6F32532EA3A12B@iblonce105.gb.ad.drkw.net",
$ curl -s 'https://lists.apache.org/api/thread.lua?id=01e80b71b223b176e0c8af1efeab5be229719e398669419d46d18f53@1078215889@<dev.cocoon.apache.org>' | python3 -m json.tool | fgrep message "message-id": "<[email protected]\n\tz>",