ingest-file icon indicating copy to clipboard operation
ingest-file copied to clipboard

Handling of Outlook MSG files and RTF bodies

Open pudo opened this issue 4 years ago • 1 comments

You have to give Microsoft credit for its consistency: instead of storing E-Mail messages in Outlook as RFC822 plain text, they came up with their own super funky file format based on OLE. We often see these in leaks (for example: the entire Panama Papers).

In Python, the most popular parser for MSG files is msg-extractor, but it's maintained by a developer who seems to prioritise implementing the spec over building a tool that could parse all files found in the wild. I tried to fix up the encoding support in the library at some point, but the PR was rejected on the basis that I should request that the source of the files fix their Exchange settings. This did not seem like a healthy option, vis a vis the Russian mafia.

So I started to maintain a fork and eventually ended up cleaning it up quite significantly. Still, two issues are pretty persistent:

a) Encodings in this file format are a mess and many files seem to be outright damaged. msglite 0.30 does much more work on this, but I'm still pretty sure we'll see issues in the future.

b) Outlook re-formats the body of many emails into RTF (Rich text format) upon receipt. In the best case, this means that each .msg file contains an HTML, a plain text and an RTF version. But sometimes the plain text is encoding-fucked, while the RTF is not. Now, msglite does provide a msg.rtfBody property with that version, but processing it further in Python is a bit difficult.

We can either - as we do currently - save the RTF as its own file and essentially declare it an attachment to the message. The attachment is then processed using convert-document and turned into a PDF. This is at best annoying (because people now need to understand that the attachment is part of a real email), and also duplicative if the main message body was extracted correctly.

The other option would be to use striprtf to turn the RTF body into a plain text body. Unfortunately, the lib currently provides an extremely naive implementation of RTF that does not handle encodings other than unicode (cf. https://github.com/joshy/striprtf/issues/11 - but the issue is larger than described there). We might want to consider PRing proper encoding support into striprtf and then adopting it.

pudo avatar Apr 22 '21 09:04 pudo

Same goes for PST files (mail box archive). See https://github.com/alephdata/ingest-file/issues/618#issuecomment-2092818980 for a workaround: after unpacking the PST archive, I'm actually going over all messages to see if the first mime-part is an application/rtf file and if so, I'm converting the RTF part to HTML and replace the content. It's kind of a hack. I didn't even bother to find a Python rtf to html library, I'm calling an external utility, which is rather expensive, computationally wise. Maybe I should also check if filename=="rtf-body.rtf" but I just wanted to ingest 60Gb of data and I don't need a perfect script ;-)

vsessink avatar May 03 '24 11:05 vsessink