ingest-file
ingest-file copied to clipboard
E-mail messages sometimes detected as text/html or text/plain (instead of message/rfc822)
As reported in https://github.com/alephdata/ingest-file/issues/618: mail files sometimes end up being recognized as either text/html
or text/plain
. This happens for example when ingesting .pst files: their outgoing mail messages don't have Received:
headers but instead seem to start with a header Status: RO
.
Analysis
Please note that the root cause of this problem is using libmagic, which actually is a sort of we-don't-know-how-it-works-but-it-seems-to-work type of file type / mime-type detection. It can do wonders but it can also get things horribly wrong.
A proper fix would be to make use of the fact that readpst spits out its e-mails with a clear .eml
file name extension, so we already know that they're message/rfc822
. Ingesting the resulting files should be made aware of the mime-type - instead of trying to re-evaluate (doing it wrong). But that's beyond scope here.
Workaround
Hand importing PST archives works best as follows:
- use readpst as used in
ingestors/email/outlookpst.py
, i.e.readpst -e -D -8 -cv
- libmagic will detect
message/rfc822
if a message begins with a Received header. This apparently doesn't need to be a proper RFC2822 compliant header, just addingReceived: from localhost (127.0.0.1)
on top of the message will do. - Thus, a simple script to only fix the problematic messages could be:
find -type f -name '*.eml' -print0|xargs -0 file --mime-type|grep -v message/rfc822|cut -f1 -d:|while read f; do sed -i '1iReceived: from localhost (127.0.0.1)' "$f"; done
Fix (Dirty)
- But I'm actually thinking that a simpler fix would be to just add
Received: from localhost (127.0.0.1)
to every message:find -type f -name '*.eml' -print0|xargs -0 sed -i '1iReceived: from localhost (127.0.0.1)'
and do this right after callingreadpst
.
Please note that I do not know what happens if an Outlook / Exchange mailbox would contain an actual attachment with the name 123.eml. Does readpst work around this? Does it overwrite the 123.eml mail message? The above script would surely "enhance" this e-mail-attachment, too - even if it weren't an actual .eml
file. But that's for another time.