ingest-file icon indicating copy to clipboard operation
ingest-file copied to clipboard

E-mail messages sometimes detected as text/html or text/plain (instead of message/rfc822)

Open vsessink opened this issue 9 months ago • 5 comments

As reported in https://github.com/alephdata/ingest-file/issues/618: mail files sometimes end up being recognized as either text/html or text/plain. This happens for example when ingesting .pst files: their outgoing mail messages don't have Received: headers but instead seem to start with a header Status: RO.

vsessink avatar May 03 '24 11:05 vsessink

Analysis

Please note that the root cause of this problem is using libmagic, which actually is a sort of we-don't-know-how-it-works-but-it-seems-to-work type of file type / mime-type detection. It can do wonders but it can also get things horribly wrong.

A proper fix would be to make use of the fact that readpst spits out its e-mails with a clear .eml file name extension, so we already know that they're message/rfc822. Ingesting the resulting files should be made aware of the mime-type - instead of trying to re-evaluate (doing it wrong). But that's beyond scope here.

Workaround

Hand importing PST archives works best as follows:

  • use readpst as used in ingestors/email/outlookpst.py, i.e. readpst -e -D -8 -cv
  • libmagic will detect message/rfc822 if a message begins with a Received header. This apparently doesn't need to be a proper RFC2822 compliant header, just adding Received: from localhost (127.0.0.1) on top of the message will do.
  • Thus, a simple script to only fix the problematic messages could be:
find -type f -name '*.eml' -print0|xargs -0 file --mime-type|grep -v message/rfc822|cut -f1 -d:|while read f; do sed -i '1iReceived: from localhost (127.0.0.1)' "$f"; done

Fix (Dirty)

  • But I'm actually thinking that a simpler fix would be to just add Received: from localhost (127.0.0.1) to every message: find -type f -name '*.eml' -print0|xargs -0 sed -i '1iReceived: from localhost (127.0.0.1)' and do this right after calling readpst.

Please note that I do not know what happens if an Outlook / Exchange mailbox would contain an actual attachment with the name 123.eml. Does readpst work around this? Does it overwrite the 123.eml mail message? The above script would surely "enhance" this e-mail-attachment, too - even if it weren't an actual .eml file. But that's for another time.

vsessink avatar May 03 '24 11:05 vsessink