mail-parser icon indicating copy to clipboard operation
mail-parser copied to clipboard

Add functions to partition the email into parts

Open zaz opened this issue 7 months ago • 6 comments

It would be useful to be able to partition emails into mutually exclusive collectively exhaustive (MECE) parts. We are already half way there with the functions

Message::attachments()
Message::html_bodies()
Message::text_bodies()

but as far as I can see, there is nothing like

Message::other_parts()

that will list all parts that are not attachments or bodies.

zaz avatar May 04 '25 12:05 zaz

I may have misread this, but it is not the case that attachments, html_bodies and text_bodies are mutually exclusive, see #67 for a case where all of them overlap.

It would be nice if there existed invariants for these functions that one could document, but for the time being my experience has been that it is not very intuitive what the semantics of those functions are.

sftse avatar May 13 '25 16:05 sftse

Perhaps I should have said Message::body_html(0)? I believe that does not include attachments.

In any case, there should be a clear partitioning available.

zaz avatar May 13 '25 20:05 zaz

The order of the parts is in some sense irrelevant, see this modification of #67

Subject: Test message from Netscape Communicator 4.7
Content-Type: multipart/mixed;
 boundary="------------C78F594988075E36AE03C243"

This is a multi-part message in MIME format.
--------------C78F594988075E36AE03C243
Content-Type: image/png;
 name="greenball.png"
Content-Transfer-Encoding: base64
Content-Disposition: inline;
 filename="greenball.png"

iVBORw0KGgoAAAANSUhEUgAAABsAAAAbCAMAAAC6CgRnAAADAFBMVEX///8AAAAAEAAAGAAA

--------------C78F594988075E36AE03C243
Content-Type: multipart/alternative;
 boundary="------------D74AE2393FB01D1B284AE257"

--------------D74AE2393FB01D1B284AE257
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

--------------D74AE2393FB01D1B284AE257
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit
<html></html>

--------------D74AE2393FB01D1B284AE257--

--------------C78F594988075E36AE03C243--

Message::body_html(0).is_none() because the first part is PartType::InlineBinary. But it also appears in Message::attachments().

In any case, there should be a clear partitioning available.

By what criteria should this partition be done?

sftse avatar May 14 '25 13:05 sftse

Since you mentioned four functions it is not clear to me whether you expected those four sets to form a partition. It's possible to have a partition X = attachments U html_bodies U text_bodies and all_parts = X U complement(X) but it is unclear how useful this is.

sftse avatar May 14 '25 13:05 sftse

The way mail-parser handles the example email you gave seems like a bug to me; either in mail-parser itself or the documentation. I would expect from the usage example, and the name of the function, that body_html(0) would return the HTML body (the first HTML body in the unusual event that there are multiple). I'm confused why it would return parts whose types are not text/html.

IMO, it would be good to have a way to partition it into the following:

  1. All HTML body parts (type text/html) not converted from plaintext
  2. All text body parts (type text/plain) not converted from HTML
  3. All other body parts
  4. All attachments
  5. Anything else?

There can be multiple ways to partition an email, but having at least one or two sensible ways built in is very generally useful. Just one example is that I'm working on an email deduplicator that, after checking if message-IDs match, double checks the rest (because often duplicates differ in encoding, headers, whitespace, etc) and tells the user exactly what differs. This check should be MECE: It should tell the user exactly what differs, but should cover every part of the email. But it really is much more general than than. Any situation where you don't want to double-count or miss anything would benefit.

I'm not sure what to think about your example above because I'm not very familiar with the RFC, and your HTML and text parts don't show up in Thunderbird and body_html(1) returns an empty string. Do you know what's going on there?

zaz avatar May 18 '25 14:05 zaz

The content was truncated to better highlight the structure of the mail. Here's the full example

Message-ID: <[email protected]>
Date: Wed, 17 May 2000 23:08:29 -0400
From: Doug Sauder <[email protected]>
X-Mailer: Mozilla 4.7 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Joe Blow <[email protected]>
Subject: Test message from Netscape Communicator 4.7
Content-Type: multipart/mixed;
 boundary="------------C78F594988075E36AE03C243"

This is a multi-part message in MIME format.

--------------C78F594988075E36AE03C243
Content-Type: image/png;
 name="greenball.png"
Content-Transfer-Encoding: base64
Content-Disposition: inline;
 filename="greenball.png"

iVBORw0KGgoAAAANSUhEUgAAABsAAAAbCAMAAAC6CgRnAAADAFBMVEX///8AAAAAEAAAGAAA
IQAACAAAMQAAQgAAUgAAWgAASgAIYwAIcwAIewAQjAAIawAAOQAAYwAQlAAQnAAhpQAQpQAh
rQBCvRhjxjFjxjlSxiEpzgAYvQAQrQAYrQAhvQCU1mOt1nuE1lJK3hgh1gAYxgAYtQAAKQBC
zhDO55Te563G55SU52NS5yEh3gAYzgBS3iGc52vW75y974yE71JC7xCt73ul3nNa7ykh5wAY
1gAx5wBS7yFr7zlK7xgp5wAp7wAx7wAIhAAQtQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAp
1fnZAAAAAXRSTlMAQObYZgAAABZ0RVh0U29mdHdhcmUAZ2lmMnBuZyAyLjAuMT1evmgAAAFt
SURBVHicddJtV8IgFAdwD2zIgMEE1+NcqdsoK+m5tCyz7/+ZiLmHsyzvq53zO/cy+N9ery1b
Ve9PWQA9z4MQ+H8Yoj7GASZ95IHfaBGmLOSchyIgyOu22mgQSjUcDuNYcoGjLiLK1cHh0fHJ
aTKKOcMItgYxT89OzsfjyTTLC8UF0c2ZNmKquJhczq6ub+YmSVUYRF59GeDastu7+9nD41Nm
kiJ2jc2J3kAWZ9Pr55fH18XSmRuKUTXUaqHy7O19tfr4NFle/w3YDrWRUIlZrL/W86XJkyJV
G9EaEjIx2XyZmZJGioeUaL+2AY8TY8omR6nkLKhu70zjUKVJXsp3quS2DVSJWNh3zzJKCyex
I0ZxBP3afE0ElyqOlZJyw8r3BE2SFiJCyxA434SCkg65RhdeQBljQtCg39LWrA90RDDG1EWr
YUO23hMANUKRRl61E529cR++D2G5LK002dr/qrcfu9u0V3bxn/XdhR/NYeeN0ggsLAAAACV0
RVh0Q29tbWVudABjbGlwMmdpZiB2LjAuNiBieSBZdmVzIFBpZ3VldDZzO7wAAAAASUVORK5C
YII=
--------------C78F594988075E36AE03C243
Content-Type: multipart/alternative;
 boundary="------------D74AE2393FB01D1B284AE257"

--------------D74AE2393FB01D1B284AE257
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Die Hasen und die Fr=F6sche

Die Hasen klagten einst =FCber ihre mi=DFliche Lage; "wir leben", sprach =
ein
Redner, "in steter Furcht vor Menschen und Tieren, eine Beute der Hunde,
der Adler, ja fast aller Raubtiere! Unsere stete Angst ist =E4rger als de=
r
Tod selbst. Auf, la=DFt uns ein f=FCr allemal sterben."

--------------D74AE2393FB01D1B284AE257
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<b>Die Hasen und die Fr&ouml;sche</b>
<p>Die Hasen klagten einst &uuml;ber ihre mi&szlig;liche Lage; "wir leben",
sprach ein Redner, "in steter Furcht vor Menschen und Tieren, eine Beute
der Hunde, der Adler, ja fast aller Raubtiere! Unsere stete Angst ist &auml;rger
als der Tod selbst. Auf, la&szlig;t uns ein f&uuml;r allemal sterben."
</html>

--------------D74AE2393FB01D1B284AE257--
--------------C78F594988075E36AE03C243--

The way mail-parser handles the example email you gave seems like a bug to me; either in mail-parser itself or the documentation. I would expect from the usage example, and the name of the function, that body_html(0) would return the HTML body (the first HTML body in the unusual event that there are multiple). I'm confused why it would return parts whose types are not text/html.

There's a reason to this, the attachment is declared as inline. This will most often mean that it is referenced in the html and rendered together with it. It is, so to speak, equivalent to it being pasted into the html. What is harder to explain, and imho not conformant with the RFC8621, is why those inline attachments also appear as standalone attachments in Message::attachments().

  • All HTML body parts (type text/html) not converted from plaintext
  • All text body parts (type text/plain) not converted from HTML
  • All other body parts
  • All attachments
  • Anything else?

That partition might make sense, but is not covered by any of the current API afaik. That partition would need an entirely different set of functions.

sftse avatar May 19 '25 09:05 sftse