olefile icon indicating copy to clipboard operation
olefile copied to clipboard

Cannot Identify Stream for HTML Body of MSG File

Open fligi7 opened this issue 6 years ago • 10 comments

When parsing an HTML .msg email file, olefile is not able to enumerate/identify the stream in which the HTML Body content resides. It is missing the standard Body stream (i.e. __substg1.0_1000001E and/or __substg1.0_1000001F) and enumerating the entire list of streams via ole.listdir(streams=True,storages=True) yields no stream(text) containing the HTML body content. Yet, viewing the .msg email in a .msg viewer shows the HTML Body content as expected.

Any idea how to identify/extract the HTML body content from a .msg file?

fligi7 avatar Mar 09 '18 17:03 fligi7

Hi @fligi7, are you using olefile alone, or with https://github.com/mattgwwalker/msg-extractor ?

If you haven't used msg-extractor, I would suggest to give it a try.

Otherwise, I would need a sample msg file to check if this is a bug in olefile.

decalage2 avatar Mar 09 '18 20:03 decalage2

I'm using the olefile library in a standalone Python script, using msg-extractor code as a guideline when I've gotten stuck or wanted to understand the library methods a bit better.

I am not using msg-extractor itself, though it also fails on the given file(s) and doesn't find/extract any Body stream/text in either the resulting message.txt or raw outputs. The emails I am using to test are a mix of both Forwarded messages (where the user simply clicked Forward and sent it) and standard HTML messages where there is clearly content in the Body.

As you can see in the ExtractMsg.py error below, it also fails to find/write a "Body" tag/stream in the email.

Error with file 'test.msg': Traceback (most recent call last): File "ExtractMsg.py", line 543, in msg.save(toJson, useFileName) File "ExtractMsg.py", line 435, in save f.write(self.body) TypeError: expected a string or other character buffer object

I tried reproducing this behavior by authoring/sending an HTML email, as well as forwarding that email, using both as samples. It appears that ExtractMsg.py only identifies/parses a Body stream from the first original HTML message but not from the Forwarded one, while my script properly extracts the Body from both.

However, these other messages I have fail on both scripts. I will have to check to see if I am able to forward these to you for analysis. If so, can you please provide an email to send to?

fligi7 avatar Mar 09 '18 21:03 fligi7

Any ideas/updates by chance?

fligi7 avatar Mar 14 '18 22:03 fligi7

Greetings, I can provide you with an update. You have participated in issues on the msg-extractor threads, so you may already know this, but I am currently the manager of the package. As such, I have become more familiar than I would ever have wanted to with the .msg format, and I can tell you why you might be able to see html body while not seeing a stream that contains it.

First, check if the stream "__substg1.0_10130102" exists, as this is where the html body may be stored in it's direct state. Should this not exist, the html body might actually still be in the msg file, just in a slightly different place. Unfortunately, it is a true pain in the backside to untangle it from the format it would be saved in.

The stream I am talking about is "__substg1.0_10090102", the compressed rtf stream. Personally, I use the compressed_rtf package (github, pypi) to decompress it. Unfortunately, that is the easy part. The hard part is de-encapsulating the html from the rtf stream. The only script that I know of that does this is rtf-stream-parser, but unfortunately it is written in javascript. I am currently trying to work on a python conversion of it, first a most basic version that acts exactly like that script, but then eventually a MUCH more advanced version that were be extremely familiar with the rtf format and so will be able to convert it to many more formats as accurately as possible.

For the sake of your sanity, I pray that you find it in the first stream.

Not sure it would help in your case, but I developed a generic RTF parser as part of my tool rtfobj in the oletools package. It's the RtfParser class here: https://github.com/decalage2/oletools/blob/master/oletools/rtfobj.py#L380

I plan to release it as a standalone package one day, I just did not have time to write a proper doc yet... If you're interested by a RTF parser, I can give some hints how to use it.

decalage2 avatar Dec 10 '18 20:12 decalage2

Not sure it would help in your case, but I developed a generic RTF parser as part of my tool rtfobj in the oletools package. It's the RtfParser class here: https://github.com/decalage2/oletools/blob/master/oletools/rtfobj.py#L380

I plan to release it as a standalone package one day, I just did not have time to write a proper doc yet... If you're interested by a RTF parser, I can give some hints how to use it.

I'll take a look, but I have yet to find an RTF parser in python that 1. can de-encapsulate html and 2. actually parse rtf files correctly. Sure, most of them work as long as your rtf file isn't stupid (meaning it is RTF compliant, but does a lot of it poorly) , but the end user doesn't always have the luxury of a perfect RTF file.

But I will take a good proper look.

@decalage2 Okay, so I have no idea at all how I would even properly use your parser. I put data in, told it to parse, but then it doesn't return anything, so where do I go from here?

Oh wait. I think you may have misunderstood my needs. Getting the html out of rtf is not as simple as just pulling out an object. No, you actually have to go through every single rtf tag, text, etc. to do it.

Also, @decalage2 The rtf parser I am working on is going to be quite intelligent, able to improve the format of rtf files so that it is much easier from programs to understand them with less code. So it can take an rtf file like this:

{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Riched20 10.0.16299}\viewkind4\uc1
\pard\sa200\sl276\slmult1\f0\fs22\lang9


\b Hello {\b0world} I am here \b0 to stay\par
\b Hello {\b0 world} I am here \b0 to stay\par
\b \b1 Hello {\b0world} I am here \b0 to stay\par
\b \b Hello {\b0 world} I am here \b0 to stay\par
\b1 \b Hello \b0 world \b I am here \b0 to stay\par


{}}

And convert it to something like this:

{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Riched20 10.0.16299}\viewkind4\uc1
\pard\sa200\sl276\slmult1\f0\fs22\lang9


\b Hello \b0 world \b I am here \b0 to stay\par
\b Hello \b0 world \b I am here \b0 to stay\par
\b Hello \b0 world \b I am here \b0 to stay\par
\b Hello \b0 world \b I am here \b0 to stay\par
\b Hello \b0 world \b I am here \b0 to stay\par
}

I have the following function:

def decode_html_body(
        msg: Message
) -> str | None:
    """
    Decodes the html body of the email file.
    Args:
        msg:

    Returns:
        str: decoded html body
    """
    encodings = [chardet.detect(msg.htmlBody)['encoding'], 'utf-8']

    for encoding in encodings:
        try:
            return msg.htmlBodyPrepared.decode(encoding)
        except UnicodeDecodeError:
            logger.error(message="Failed to decode using available encodings.",)
            pass

    logger.error(message="Failed to decode using available encodings.", email=msg.filename)
    return None

When running it locally I get no issues, when running it in docker I get the stream error:

Stream "__substg1.0_10130102" was requested but could not be found. Returning `None`.

But right after getting the error I am able to see the HTML result from the function since I am printing it after calling.

Any ideas in how to overcome this issue?

jorgesisco avatar Dec 08 '23 15:12 jorgesisco

@jorgesisco That's more relevant to the extract-msg project (which is what your code uses). The log message is an information log and not a warning or an error, and means that the MSG file didn't contain a direct stream for the HTML body. If the code is returning something, it was generated from the RTF body or the plain text body.

As such, there isn't really an issue to overcome as far as I can see, except possibly making sure your setup isn't incorrectly showing showing information logs as errors (though it's possible an older version of the module had that information log as error or warning incorrectly).

If you have further questions, I'd recommend taking them to the extract-msg GitHub page or the Discord linked in the GitHub.

Not sure if this thread has any need to remain open at this point.

Edit: just checked, unless you are using a version older that extract-msg 0.29.0, that log is just an information one