alot icon indicating copy to clipboard operation
alot copied to clipboard

Special characters not displays correctly

Open varac opened this issue 5 years ago • 22 comments

In some mails special characters (in this case german Umlauts) are not displayed correctly in the message view. I.e. können displays as k�nnen. In the same terminal, notmuch show --format=raw show the special characters correctly.

In a mail that displays badly I noticed these Content headers:

Content-Type           multipart/mixed; boundary="------------E72523F6326A74524F4B3BE4"
Content-Language       de-DE

Another mail that displays fine had these headers:

Content-Type              text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding 8bit
Content-Language          en-US

Software Versions

  • Python version: 3.6.5
  • Notmuch version: 0.26
  • Alot version: master

varac avatar Sep 15 '18 07:09 varac

Just as a reference, msgs with garbled Umlauts in alot display fine with k9mail on android.

varac avatar Sep 15 '18 07:09 varac

I also receive japanese messages (using alot master), with the titles displayed perfeclty

MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-2022-JP"
Content-Transfer-Encoding: 7bit
Subject: [OFFICE-TYO 15779]
 =?iso-2022-jp?b?W09GRklDRS1BTEwgMTA4NzRdIBskQiFaOm42SDMrO08bKEI=?=
 =?iso-2022-jp?b?GyRCTyJNbSFbGyhCIFsgGyRCJWElcyVGJUolcyU5PnBKcxsoQiAtIA==?=
 =?iso-2022-jp?b?GyRCOERKTBsoQiBdIBskQiVtJTAlJCVzJTIhPCVIJSYlJyUkJTUbKEI=?=
 =?iso-2022-jp?b?GyRCITwlUBsoQih0emstbGdzLWVndzAxKRskQiVhJXMlRiVKJXMbKEI=?=
 =?iso-2022-jp?b?GyRCJTkkTiQqQ04kaSQ7GyhC?=

while the actual content appears as:


?$B%9%?%C%U3F0L?(B

?$B%0%k!<%W<RFbJs!V?(Bi.X.?$B!W$K?7$7$$5-;v$r7G:\$7$^$7$?!#?(B

I don't mind sharing such emails with maintainers (I lurk on irc with the same nickname).

Likewise, mails appear just fine on astroid or in notmuch.

teto avatar Sep 15 '18 09:09 teto

@teto could you test this mail on this branch and report back with a logfile?

pazz avatar Dec 06 '18 09:12 pazz

@varac regarding your original messages: I'm not sure what Content-Language does but relevant for the decoding of the messages payload is the Content-Transfer-Encoding header. Your second message has this one set to indicate an utf8 encoding, the first one (unless there is a CTE header that you omitted) indicates ascii. If indeed there are non-ascii characters in the payload then there is little we can do. Just FYI, what happens in this case is we use libmagic to guess the correct encoding and fall back to utf8 and drop/ignore non-tf8 characters.

pazz avatar Dec 06 '18 09:12 pazz

@pazz I will definitely do it but maybe sometime next week

teto avatar Dec 17 '18 03:12 teto

since the mentioned patch got merged on master, I installed master but the bug persists

DEBUG:utils:Content-Transfer-Encoding: "7bit"
DEBUG:utils:assuming Content-Transfer-Encoding: 7bit
DEBUG:utils:unquoted header: |[email protected]|
DEBUG:utils:unquoted header: |[OFFICE-TYO 16752] [OFFICE-ALL 11642] 【リド】対策製品定メンテナンスのお知らせ|
DEBUG:utils:Content-Transfer-Encoding: "7bit"
DEBUG:utils:assuming Content-Transfer-Encoding: 7bit
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: No, cursor on summary
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: No, cursor on summary
DEBUG:ui:Got key (['G'], [71])
DEBUG:ui:cmdline: 'move last'
DEBUG:ui:thread command string: "move last"
DEBUG:__init__:mode:thread got commandline "move last"
DEBUG:__init__:ARGS: ['move', 'last']
DEBUG:__init__:cmd parms {'movement': ['last']}
DEBUG:thread:last
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: removing unread
DEBUG:manager:write-out item: ('untag', <function Message.remove_tags.<locals>.myafterwards at 0x7fa99a812378>, 'id:[email protected]', ['unread'])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:<function Message.remove_tags.<locals>.myafterwards at 0x7fa99a812378>
DEBUG:manager:called callback
DEBUG:manager:flush finished
DEBUG:globals:flush complete
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: No, mid locked for autorm-unread
DEBUG:ui:Got key (['Q'], [81])
DEBUG:ui:cmdline: 'exit'
DEBUG:ui:thread command string: "exit"
DEBUG:__init__:mode:thread got commandline "exit"
DEBUG:__init__:ARGS: ['exit']
DEBUG:__init__:cmd parms {}
DEBUG:globals:flush complete
DEBUG:manager:Worker process 7086 returned error code 1
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: No, mid locked for autorm-unread

and when I open the mail (I removed some fields)

MIME-Version              1.0
Content-Type              text/plain; charset="UTF-8"
X-Mailer                  Laocoon
Content-Transfer-Encoding 8bit
X-MIME-Autoconverted      from quoted-printable to 8bit by of-oml1504.hop.2iij.net id wBA7uGsj011259
X-BeenThere               [email protected]
X-Mailman-Version         2.1.x
Precedence                list
Subject                    12月 Windowsサーバ
X-BeenThere               [email protected]
X-TUID                    GrHUu1YjIOhJ

              12\u6708 Windows\u30b5\u30fc\u30d0\u30e1\u30f3\u30c6\u30ca\u30f3\u30b9\u60c5\u5831\u306e\u3054\u9023\u7d61

2018\u5e7412\u6708\u5206\u306eWindows\u30b5\u30fc\u30d0\u30e1\u30f3\u30c6\u30ca\u30f3\u30b9(Windows Update)\u30b9\u30b1\u30b8\u30e5\u30fc\u30eb\u3092
\u5468\u77e5\u3055\u305b\u3066\u9802\u304d\u307e\u3059\u3002

======================================================================

\u25a0\u30e1\u30f3\u30c6\u30ca\u30f3\u30b9\u60c5\u5831\u63b2\u8f09\u5148
  Confluence\u306e\u4ee5\u4e0b\u30da\u30fc\u30b8\u306b\u6bce\u6708\u306e\u30e1\u30f3\u30c6\u30ca\u30f3\u30b9\u60c5\u5831\u3092\u63b2\u8f09\u3057\u3066\u304a\u308a\u307e\u3059\u306e\u3067
  \u3054\u78ba\u8a8d\u4e0b\u3055\u3044\u3002

  \u30fbIIJ\u793e\u5185\u30b5\u30fc\u30d0Windows Update\u30b9\u30b1\u30b8\u30e5\u30fc\u30eb
    - https://cf.iij-group.jp/pages/viewpage.action?pageId=36534214

in the thread view, we see assuming Content-Transfer-Encoding: 7bit and japanese titles are displayed correctly but when I open the message, the title is also displayed correcrlty but then it seems to assume 8 bit encoding. I wonder if that's the problem etc...

teto avatar Jan 07 '19 08:01 teto

@teto can you provide a (trimmed) mail file that we can add to the tests? compare #1359.

lucc avatar Jan 07 '19 09:01 lucc

looking at the log for another issue (some kind of plugin I develop), I've also noticed these:

DEBUG:utils:Content-Transfer-Encoding: "8bit"
DEBUG:utils:assuming Content-Transfer-Encoding: 8bit
DEBUG:utils:Decoding failure: 'utf-8' codec can't decode byte 0xe9 in position 14535: invalid continuation byte

teto avatar Jan 07 '19 09:01 teto

Aren't there any already available in notmuch testsuite (would be in notmuch-0.28/test/corpora ) ? my message appears fine with notmuch. I can share the problematic mails via mails/irc but due to my company policies and other logistical issues, it's hard to share these mails with the test database.

teto avatar Jan 09 '19 07:01 teto

I did not look into the notmuch test suite. You should not share the original mails. As you can see in #1359 (which I referenced above) I just constructed a minimal mail without any valuable information that shows the problem at hand. Now my request for other mail files that trigger this bug was actually for minimal and fully anonymized mails so that we can put them into the test suite.

lucc avatar Jan 09 '19 16:01 lucc

@lucc I've tried to come up with a minimal test here https://github.com/pazz/alot/pull/1369. I could not really "test" the test, yet I hope it's enough for alot contributors to find the issue.

teto avatar Jan 16 '19 03:01 teto

Another msg with Umlauts, which displays fine with notmuch but shows garbled Umlauts in alot. I grepped for Content headers:

 ~ $ notmuch show --format=raw id:[email protected] | grep -i content
Content-Language: en-US
Content-Type: multipart/mixed; boundary="===============5042742156636891870=="
Content-Type: multipart/alternative;
Content-Language: en-US
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Content-Disposition: inline

varac avatar Mar 20 '19 08:03 varac

@varac: can you share an anonymized but full email to include as test case?

pazz avatar Mar 20 '19 11:03 pazz

Hello, I dug a big into how this problem shows up to me, and I hope to not be misleading for the context of this issue.

In my case the problem is not with the text/plain part, but rather with the text/html part of a message. When I use the version of alot in Debian 9 (0.3.6), elinks renders the message just fine. When I use the version of alot in the current master branch (d0297605c0ec1c6b65f541d0fd5b69ac5a0f4ded), elinks renders the message with a � character.

This is an example message:

To: [email protected]
From: [email protected]
Date: Thu, 18 Apr 2019 10:33:56 -0300
Subject: Encoding example
Content-type: multipart/mixed; boundary="----------=_1555594448-4177-57454"

This is a multi-part message in MIME format...

------------=_1555594448-4177-57454
Content-Type: multipart/alternative;
 boundary="------------7E9A5FC51A937A20CA2D0044"
Content-Language: pt-BR

This is a multi-part message in MIME format.
--------------7E9A5FC51A937A20CA2D0044
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Açúcar


--------------7E9A5FC51A937A20CA2D0044
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Açúcar</p>
  </body>
</html>

--------------7E9A5FC51A937A20CA2D0044--

------------=_1555594448-4177-57454
Content-Type: text/plain; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit


------------=_1555594448-4177-57454--

This is the ~/.mailcap line i use for text/html:

text/html; elinks -dump -dump-charset utf8 %s; nametemplate=%s.html; copiousoutput;

I'm using elinks from Debian 9 (0.12pre6)

This is how it shows in alot 3.0.6:

0 3 6

This is how it shows in alot from master branch (d0297605c0ec1c6b65f541d0fd5b69ac5a0f4ded):

master

If there are more details I can provide, please let me know.

If this should be the subject for another issue, please let me know and I'll create a different issue and remove this comment from here.

drebs avatar Apr 18 '19 18:04 drebs

I have the same problem as @drebs but with German umlauts in html parts of a multipart message. It seems the problem occurs in db/utils.py:remove_cte(): https://github.com/pazz/alot/blob/f9689575e771aaa2e6cd4da13b71b88cf8e9e246/alot/db/utils.py#L419

Changing the encoding from raw-unicode-escape to utf-8 fixes the problem for me and the example mail posted in the comment above. https://github.com/pazz/alot/blob/f9689575e771aaa2e6cd4da13b71b88cf8e9e246/alot/db/utils.py#L425

There's a comment stating that

Python's mail library may decode 8bit as raw-unicode-escape, so we need to encode that back to bytes so we can decode it using the correct encoding, or it might not, in which case assume that the str representation we got is correct.

I have no idea if this is still valid and my fix will produce other errors. All tests pass.

sgelb avatar Jul 17 '19 14:07 sgelb

I can confirm that @sgelb's fix also works for my case where Greek letters were rendered as unicode character sequences (the example below would render as \u0393\u03b5\u03b9\u03b1 \u03c3\u03b1\u03c2).

A minimal example of such an email is:

MIME-Version: 1.0
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<!DOCTYPE html>
<html lang="en"><body>Γεια σας</body></html>

Cybolic avatar Apr 14 '20 18:04 Cybolic

Could someone please check if

  1. this is fixed by, or even related to #1485
  2. @sgelb's proposed change affects (de)cryption for PGP mails

pgp is very fragile when it comes to mail encodings and related standarts and I want to make sure we don't break it here.. Cheers!

pazz avatar May 06 '20 09:05 pazz

@pazz answering your questions:

  1. In alot's master branch I still producing the problem, I haven't review #1485 but it doesn't look like it has fixed the problem.
  2. @sgelb's proposal fixes the problem to me and I'm able to decrypt PGP emails correctly.

meskio avatar Jun 04 '20 09:06 meskio

OK fine. Thanks for keeping this alive. Will someone please send a PR?

pazz avatar Jun 04 '20 15:06 pazz

@sgelb as you did the proposal, do you want to send the PR? If not I can send it myself, I'll wait a couple of days for @sgelb reply and they don't answer before I'll do it this weekend.

meskio avatar Jun 04 '20 15:06 meskio

@sgelb as you did the proposal, do you want to send the PR?

I don't use alot anymore, so I'd prefer if someone else would send the PR.

sgelb avatar Jun 07 '20 14:06 sgelb

I belive this was solved by: 37395809db473fb9a4157084a5b1ea3165914556

I think this issue can be closed.

meskio avatar Aug 18 '20 07:08 meskio