simple-java-mail
simple-java-mail copied to clipboard
Emails created from plain text Outlook messages lose formatting in emailToEML()
I create an email object from a .msg file that was in Outlook's plaintext format using outlookMsgToEmail(). When I then try to use emailToEML() to save the email as a .eml, all formatting (e.g. line breaks) is lost.
On investigation it appears that outlookMsgToEmail() is creating an object with plain text content available using both getPlainText() and getHTMLText(). I'm not sure whether this is a problem in outlookMsgToEmail() or in the actual Outlook plain text file format. It then seems that emailToEML() is using the content which is plain text, but treating it as HTML, therefore losing the line breaks.
This problem doesn't occur when creating an email object from emlToEmail(), as here getPlainText() produces the plain text format but getHTMLText() shows as null, and emailToEml() seems to treat the plain text as plain text and keep line breaks intact.
public static void viewMsg() throws IOException {
Email em1 = EmailConverter.outlookMsgToEmail(new File("inputPlainText.msg"));
String htmlText = em1.getHTMLText();
String plainText = em1.getPlainText();
System.out.println("plainText: " + plainText);
System.out.println("htmlText: " + htmlText);
String emlStr = EmailConverter.emailToEML(em1);
FileOutputStream outputStream = new FileOutputStream("output1.eml");
byte[] strToBytes = emlStr.getBytes();
outputStream.write(strToBytes)
outputStream.close();
}
public static void viewEml() throws IOException {
Email em1 = EmailConverter.emlToEmail(new File("inputPlainText.eml"));
String htmlText = em1.getHTMLText();
String plainText = em1.getPlainText();
System.out.println("plainText: " + plainText);
System.out.println("htmlText: " + htmlText);
String emlStr = EmailConverter.emailToEML(em1);
FileOutputStream outputStream = new FileOutputStream("output2.eml");
byte[] strToBytes = emlStr.getBytes();
outputStream.write(strToBytes)
outputStream.close();
}
I'm looking into it, but the Outlook msg actually isn't just plain text; it contains an RTF body, which is converted to HTML by the Outlook message parser library. The RTF text has endlines containg \r\n, which are not converted to
elements. And since HTML body is prioritized over text bodies by email clients (unless you specifically configured them to ignore HTML), you see the content with the missing line breaks.
I'm currently looking into whether the RTF conversion is correct. Stay tuned.
It's caused by this bug: https://github.com/bbottema/rtf-to-html/issues/6
Given the discussion at bbottema/rtf-to-html#6, is it possible for me to choose the RTFToHTML converter that emailToEML() uses, so if I am processing an MSG plain text file I can use the RTF2HTMLConverterJEditorPane which produces formatted HTML?
Really the only reason the swing converter is still there because of experimental / research purposes, but it should really not be considered for real world application. It really doesn't handle RTF properly.
That said, the other don't either in this particular (exotic?) case.
I'm just not keen on making this part of the builder api. Perhaps making it a property is enough? Like a sneaky backdoor config for Outlook RTF conversion.