simple-java-mail icon indicating copy to clipboard operation
simple-java-mail copied to clipboard

Emails created from plain text Outlook messages lose formatting in emailToEML()

Open atmcq opened this issue 4 years ago • 4 comments

I create an email object from a .msg file that was in Outlook's plaintext format using outlookMsgToEmail(). When I then try to use emailToEML() to save the email as a .eml, all formatting (e.g. line breaks) is lost.

On investigation it appears that outlookMsgToEmail() is creating an object with plain text content available using both getPlainText() and getHTMLText(). I'm not sure whether this is a problem in outlookMsgToEmail() or in the actual Outlook plain text file format. It then seems that emailToEML() is using the content which is plain text, but treating it as HTML, therefore losing the line breaks.

This problem doesn't occur when creating an email object from emlToEmail(), as here getPlainText() produces the plain text format but getHTMLText() shows as null, and emailToEml() seems to treat the plain text as plain text and keep line breaks intact.

	public static void viewMsg() throws IOException {

		Email em1 = EmailConverter.outlookMsgToEmail(new File("inputPlainText.msg"));
		
		String htmlText = em1.getHTMLText();
		String plainText = em1.getPlainText();
		
		System.out.println("plainText: " + plainText);
		System.out.println("htmlText: " + htmlText);

		String emlStr = EmailConverter.emailToEML(em1);
		
		FileOutputStream outputStream = new FileOutputStream("output1.eml");
		byte[] strToBytes = emlStr.getBytes();		
		outputStream.write(strToBytes)
		outputStream.close();

	}

	public static void viewEml() throws IOException {

		Email em1 = EmailConverter.emlToEmail(new File("inputPlainText.eml"));
		
		String htmlText = em1.getHTMLText();
		String plainText = em1.getPlainText();
		
		System.out.println("plainText: " + plainText);
		System.out.println("htmlText: " + htmlText);

		String emlStr = EmailConverter.emailToEML(em1);
		
		FileOutputStream outputStream = new FileOutputStream("output2.eml");
		byte[] strToBytes = emlStr.getBytes();		
		outputStream.write(strToBytes)
		outputStream.close();

	}

samples.zip

atmcq avatar May 23 '21 22:05 atmcq

I'm looking into it, but the Outlook msg actually isn't just plain text; it contains an RTF body, which is converted to HTML by the Outlook message parser library. The RTF text has endlines containg \r\n, which are not converted to
elements. And since HTML body is prioritized over text bodies by email clients (unless you specifically configured them to ignore HTML), you see the content with the missing line breaks.

I'm currently looking into whether the RTF conversion is correct. Stay tuned.

bbottema avatar May 29 '21 14:05 bbottema

It's caused by this bug: https://github.com/bbottema/rtf-to-html/issues/6

bbottema avatar May 29 '21 15:05 bbottema

Given the discussion at bbottema/rtf-to-html#6, is it possible for me to choose the RTFToHTML converter that emailToEML() uses, so if I am processing an MSG plain text file I can use the RTF2HTMLConverterJEditorPane which produces formatted HTML?

atmcq avatar Jun 01 '21 22:06 atmcq

Really the only reason the swing converter is still there because of experimental / research purposes, but it should really not be considered for real world application. It really doesn't handle RTF properly.

That said, the other don't either in this particular (exotic?) case.

I'm just not keen on making this part of the builder api. Perhaps making it a property is enough? Like a sneaky backdoor config for Outlook RTF conversion.

bbottema avatar Jun 02 '21 05:06 bbottema