Haraka bodytext should use `\r\n` for line endings.

This is more of a knit-pick based on the RFCs, but MIME requires \r\n as the line endings for text. Haraka decodes lines and then uses \n instead. When content is transferred as 7-bit, it is therefore ambiguous if \n was intended by the sender.

Since \r\n is what is specified by the RFCs, and very likely what Haraka actually receives, should that line ending be used instead when presenting bodytext?

Oct 04 '17 14:10 atheken

bodytext is considered to be decoded, not what was sent in the email. It could have been base64 or qp encoded, and in a different character set, so there's no expectation of maintaining line endings either.

On Wed, Oct 4, 2017 at 10:01 AM, Andrew Theken [email protected] wrote:

This is more of a knit-pick based on the RFCs, but MIME requires \r\n as the line endings for text. Haraka decodes lines and then uses \n instead. When content is transferred as 7-bit, it is therefore ambiguous if \n was intended by the sender.

Since \r\n is what is specified by the RFCs, and very likely what Haraka actually receives, should that line ending be used instead when presenting bodytext?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haraka/Haraka/issues/2182, or mute the thread https://github.com/notifications/unsubscribe-auth/AAobY-WFJ25q_vGxV8R5LNV3r8TpG4acks5so4_AgaJpZM4Ptqa6 .

Oct 04 '17 15:10 baudehlo

Considering the context though, it sure seems logical that bodies always have CRLF endings. I'm actually kind of surprised that we don't have CRLF defined as a constant in haraka-constants, and use it widely.

Oct 04 '17 16:10 msimerson

My point here was exactly what @msimerson pointed out, which is that unless you have a broken SMTP client, you'll always get CRLF for 7/8-bit text parts, not a bare LF.

Ideally, Haraka gives plugins an opportunity to see exactly what was submitted, optionally "normalizing" line endings. If normalizing line endings is desired, that seems like a job for filters, not something that is enabled by default.

QP and Base64 exist explicitly to preserve the bytes of the original part. I think we can agree that in that case, Haraka should emit exactly the bytes that it received.

Oct 11 '17 15:10 atheken

But it's also decoded from whatever encoding was used to utf-8, so it's explicitly NOT exactly the bytes received.

On Wed, Oct 11, 2017 at 11:36 AM, Andrew Theken [email protected] wrote:

My point here was exactly what @msimerson https://github.com/msimerson pointed out, which is that unless you have a broken SMTP client, you'll always get CRLF for 7/8-bit text parts, not a bare LF.

Ideally, Haraka gives plugins an opportunity to see exactly what was submitted, optionally "normalizing" line endings. If normalizing line endings is desired, that seems like a job for filters, not something that is enabled by default.

QP and Base64 exist explicitly to preserve the bytes of the original part. I think we can agree that in that case, Haraka should emit exactly the bytes that it received.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haraka/Haraka/issues/2182#issuecomment-335852560, or mute the thread https://github.com/notifications/unsubscribe-auth/AAobY6hqgg8qFri0QpqVuN7DB36iwIg9ks5srOCIgaJpZM4Ptqa6 .

Oct 11 '17 15:10 baudehlo

Charset is used to interpret what the bytes mean, not which bytes to keep or discard. Re-encoding that string to the same charset should not be lossy. That is a fundamentally different than Haraka choosing to modify that text after it has been decoded.

Oct 11 '17 15:10 atheken

There's no guarantee that round-tripping encodings produces the same bytes. There are different ways of representing the same characters in unicode.

All I'm saying is that to expect bodytext to contain the pristine bytes is an invalid expectation, it's fundamentally not the pristine bytes from decoding from QP or Base64. It's a decoded form that is designed to be used for things unrelated to the mail stream itself (such as saving the text to a database, looking for URLs, etc). To that end, I don't see why we'd be expected to maintain \r\n line endings.

On Wed, Oct 11, 2017 at 11:57 AM, Andrew Theken [email protected] wrote:

Charset is used to interpret what the bytes mean, not which bytes to keep or discard. Re-encoding that string to the same charset should not be lossy. That is a fundamentally different than Haraka choosing to modify that text after it has been decoded.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haraka/Haraka/issues/2182#issuecomment-335858704, or mute the thread https://github.com/notifications/unsubscribe-auth/AAobY-qjbgdL6_Z18lRf06BmlJbfg1vCks5srOVTgaJpZM4Ptqa6 .

Oct 11 '17 16:10 baudehlo

Yes, I understand that normalizing code points could cause the raw bytes to change when re-encoded, but unicode specifically allows this, and indicates that those code points "mean" the same thing. That is a potentially necessary modification of the data based on node or iconv's implementations, so that seems reasonable to me.

I don't think this is a very strong argument: "since some stuff can change, we should feel OK also modifying content that would otherwise not change."

Maybe we can come around to another way of thinking about this:

The raw bytes of a text/* part are important to me, but I don't want to read/parse the raw message stream directly, because Haraka has a better parsing implementation than I would write, and the decoding functions are not available to me in plugins (so I'd have to hack to get at them).

Can I easily gain access to the unmodified text bytes so that I can examine them directly (as I might for attachment content)?

Oct 11 '17 16:10 atheken

I'd support an option that says: bodytext should be the raw encoded data (encoding AND \r\n status).

So you'd do something like:

transaction.parse_body = true; transaction.parse_body_raw = true;

And then you'd get your desired behaviour, and it wouldn't even decode it to UTF-8.

On Wed, Oct 11, 2017 at 12:39 PM, Andrew Theken [email protected] wrote:

Yes, I understand that normalizing code points could cause the raw bytes to change when re-encoded, but unicode specifically allows this, and indicates that those code points "mean" the same thing. That is a potentially necessary modification of the data based on node or iconv's implementations, so that seems reasonable to me.

I don't think this is a very strong argument: "since some stuff can change, we should feel OK also modifying content that don't need changing."

Maybe we can come around to another way of thinking about this:

The raw bytes of a text/* part are important to me, but I don't want to read/parse the raw message stream directly, because Haraka has a better parsing implementation than I would write, and the decoding functions are not available to me in plugins (so I'd have to hack to get at them).

Can I easily gain access to the unmodified text bytes so that I can examine them directly (as I might for attachment content)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/haraka/Haraka/issues/2182#issuecomment-335871182, or mute the thread https://github.com/notifications/unsubscribe-auth/AAobY5Tbf_S0RdE56j5RqzG9Fcp7lucZks5srO8_gaJpZM4Ptqa6 .

Oct 11 '17 17:10 baudehlo

I think that idea could work, but the internal conditionals and code for what to assign to bodytext could get complicated. If I were doing this, I would think about about having a property called body_part_handling, which might be "none", "parse", or "buffer", and letting that drive the behavior, instead of defining the behavior across multiple properties. However, having the bodytext property change type based on these flags seems like it'll be confusing to users.

Another way that could fit better with current Haraka functionality might be:

Give text parts the same treatment as attachments

This would be done by adding another hook to mailbody/transaction for text_part_start. This hook would be like attachment_start, except it emits text part streams, exactly as they are provided off of the socket.

Then, update the docs to explicitly state that bodytext is a best effort by Haraka to decode the raw message text into something easily consumable (primarily for display).

Oct 11 '17 18:10 atheken

It would have been nice to know that haraka didn't decode CRLF before setting up SMTP connections in another node server (my IMAP) that used them, this should at least be more clear in the documentation

Jan 11 '19 18:01 LizAinslie

You're always free to make a PR if you feel it's needed.

On 11 Jan 2019, at 18:23, Landon Gravat [email protected] wrote:

It would have been nice to know that haraka didn't decode CRLF before setting up SMTP connections in another node server (my IMAP) that used them, this should at least be more clear in the documentation

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Jan 13 '19 11:01 KingNoosh

true, i probably 😂 will later when i have more time

Jan 13 '19 22:01 LizAinslie

Haraka Haraka copied to clipboard

bodytext should use `\r\n` for line endings.

Haraka
Haraka copied to clipboard