convert-outlook-msg-file UnicodeDecodeError

I’ve been encoutering messages such as the following when trying to convert several MSG files: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 933: invalid start byte Unfortunately, I can’t share the MSGs, out of privacy concerns.

I noticed that if I change line #321 in outlookmsgfile.py from: return value.decode("utf8") to: return value.decode("latin-1")

… then those MSGs seem to be processed fine. I don’t understand the comment on lines 318-320 very well, and I’m not familiar enough with Python to understand the whole code, so I thought I’d just mention this here rather than doing a pull request.

If any more information or testing on those files is needed, please ask.

Mar 08 '23 12:03 palpalpalpal

Based on my code comment above that line, that change might be entirely right.

Mar 10 '23 12:03 JoshData

I'm having a similar problem with some .msg files from our users. They have 0x93 and 0x94 bytes all over the place ("smart quotes" from Windows-1252) that make the UTF-8 decoder sad.

The problem seems to be that outlookmsgfile.py tries to decode all String8 fields using the UTF-8 codec, which is not correct - it should use the code page is defined in a separate property (if I'm reading the Microsoft documentation correctly: property 0x3FDE for the body, property 0x3FFD for the rest of the String8 fields)

I'm currently working on a patch that adds some codec-selection logic, but there doesn't seem to be a way add do this without partially rewriting the property parser.

The problems I have run in to so far:

The "code page" properties are seldom defined right at the start of the __properties_version1.0, but there needs to be a way to find them before decoding any String8 type properties
Where should the "currently active" encoding(s) be stored? The decoder classes all only have one static method. I'm assuming this is per message (not sure if it can be changed in attachments (message-in-message)). I don't want to add globals for this.
Some .msg files specify the body to be UTF-8 (code page 65001) when they're clearly code page 1252. I don't know if this is because I'm reading Microsoft's documentation of the properties wrong (look for "PidTagBody", "PidTagInternetCodepage" and "PidTagMessageCodepage"), or if the Outlook version my users use has a bug. Either way, some kind of fallback mechanism is probably needed (first try body code page, if that fails, use the "message" code page).

Conversion works when I use the msgconvert script -- I think this works accidentally though, because of the way Perl works "magically" with "character strings" and "byte strings".

Feb 22 '24 20:02 MartijnVdS

After a bit more reading and poking around with different msg files I think I've figured out how it is supposed to work:

0x3fde is the encoding of the original message (which could be used to re-create a message in its original encoding)

0x3ffd is the encoding of all String8 type fields.

So if you have a .msg for a received email that was sent in iso-8859-1 but windows/outlook is set to use code page 1252, property 0x3fde will be 28591 (the "code page number" for iso-8859-1) and 0x3ffd will be 1252

Feb 22 '24 21:02 MartijnVdS

Where should the "currently active" encoding(s) be stored? The decoder classes all only have one static method. I'm assuming this is per message (not sure if it can be changed in attachments (message-in-message)). I don't want to add globals for this.

If you can take a shot at at a rough pull request, I can probably find some time to square away things like this. I'm not sure ahead of time what the right approach would be.

Thanks for the deep dive.

Feb 22 '24 22:02 JoshData