alot
alot copied to clipboard
Special characters not displays correctly
In some mails special characters (in this case german Umlauts) are not displayed correctly in the message view.
I.e. können
displays as k�nnen
.
In the same terminal, notmuch show --format=raw
show the special characters correctly.
In a mail that displays badly I noticed these Content headers:
Content-Type multipart/mixed; boundary="------------E72523F6326A74524F4B3BE4"
Content-Language de-DE
Another mail that displays fine had these headers:
Content-Type text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding 8bit
Content-Language en-US
Software Versions
- Python version: 3.6.5
- Notmuch version: 0.26
- Alot version: master
Just as a reference, msgs with garbled Umlauts in alot display fine with k9mail on android.
I also receive japanese messages (using alot master), with the titles displayed perfeclty
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-2022-JP"
Content-Transfer-Encoding: 7bit
Subject: [OFFICE-TYO 15779]
=?iso-2022-jp?b?W09GRklDRS1BTEwgMTA4NzRdIBskQiFaOm42SDMrO08bKEI=?=
=?iso-2022-jp?b?GyRCTyJNbSFbGyhCIFsgGyRCJWElcyVGJUolcyU5PnBKcxsoQiAtIA==?=
=?iso-2022-jp?b?GyRCOERKTBsoQiBdIBskQiVtJTAlJCVzJTIhPCVIJSYlJyUkJTUbKEI=?=
=?iso-2022-jp?b?GyRCITwlUBsoQih0emstbGdzLWVndzAxKRskQiVhJXMlRiVKJXMbKEI=?=
=?iso-2022-jp?b?GyRCJTkkTiQqQ04kaSQ7GyhC?=
while the actual content appears as:
?$B%9%?%C%U3F0L?(B
?$B%0%k!<%W<RFbJs!V?(Bi.X.?$B!W$K?7$7$$5-;v$r7G:\$7$^$7$?!#?(B
I don't mind sharing such emails with maintainers (I lurk on irc with the same nickname).
Likewise, mails appear just fine on astroid or in notmuch.
@teto could you test this mail on this branch and report back with a logfile?
@varac regarding your original messages: I'm not sure what Content-Language
does but relevant for the decoding of the messages payload is the Content-Transfer-Encoding
header.
Your second message has this one set to indicate an utf8 encoding, the first one (unless there is a CTE header that you omitted) indicates ascii. If indeed there are non-ascii characters in the payload then there is little we can do. Just FYI, what happens in this case is we use libmagic to guess the correct encoding and fall back to utf8 and drop/ignore non-tf8 characters.
@pazz I will definitely do it but maybe sometime next week
since the mentioned patch got merged on master, I installed master but the bug persists
DEBUG:utils:Content-Transfer-Encoding: "7bit"
DEBUG:utils:assuming Content-Transfer-Encoding: 7bit
DEBUG:utils:unquoted header: |[email protected]|
DEBUG:utils:unquoted header: |[OFFICE-TYO 16752] [OFFICE-ALL 11642] 【リド】対策製品定メンテナンスのお知らせ|
DEBUG:utils:Content-Transfer-Encoding: "7bit"
DEBUG:utils:assuming Content-Transfer-Encoding: 7bit
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: No, cursor on summary
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: No, cursor on summary
DEBUG:ui:Got key (['G'], [71])
DEBUG:ui:cmdline: 'move last'
DEBUG:ui:thread command string: "move last"
DEBUG:__init__:mode:thread got commandline "move last"
DEBUG:__init__:ARGS: ['move', 'last']
DEBUG:__init__:cmd parms {'movement': ['last']}
DEBUG:thread:last
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: removing unread
DEBUG:manager:write-out item: ('untag', <function Message.remove_tags.<locals>.myafterwards at 0x7fa99a812378>, 'id:[email protected]', ['unread'])
DEBUG:manager:cmd created
DEBUG:manager:got write lock
DEBUG:manager:got atomic
DEBUG:manager:ended atomic
DEBUG:manager:ended atomic
DEBUG:manager:closed db
DEBUG:manager:<function Message.remove_tags.<locals>.myafterwards at 0x7fa99a812378>
DEBUG:manager:called callback
DEBUG:manager:flush finished
DEBUG:globals:flush complete
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: No, mid locked for autorm-unread
DEBUG:ui:Got key (['Q'], [81])
DEBUG:ui:cmdline: 'exit'
DEBUG:ui:thread command string: "exit"
DEBUG:__init__:mode:thread got commandline "exit"
DEBUG:__init__:ARGS: ['exit']
DEBUG:__init__:cmd parms {}
DEBUG:globals:flush complete
DEBUG:manager:Worker process 7086 returned error code 1
DEBUG:thread:Tbuffer: auto remove unread tag from msg?
DEBUG:thread:Tbuffer: No, mid locked for autorm-unread
and when I open the mail (I removed some fields)
MIME-Version 1.0
Content-Type text/plain; charset="UTF-8"
X-Mailer Laocoon
Content-Transfer-Encoding 8bit
X-MIME-Autoconverted from quoted-printable to 8bit by of-oml1504.hop.2iij.net id wBA7uGsj011259
X-BeenThere [email protected]
X-Mailman-Version 2.1.x
Precedence list
Subject 12月 Windowsサーバ
X-BeenThere [email protected]
X-TUID GrHUu1YjIOhJ
12\u6708 Windows\u30b5\u30fc\u30d0\u30e1\u30f3\u30c6\u30ca\u30f3\u30b9\u60c5\u5831\u306e\u3054\u9023\u7d61
2018\u5e7412\u6708\u5206\u306eWindows\u30b5\u30fc\u30d0\u30e1\u30f3\u30c6\u30ca\u30f3\u30b9(Windows Update)\u30b9\u30b1\u30b8\u30e5\u30fc\u30eb\u3092
\u5468\u77e5\u3055\u305b\u3066\u9802\u304d\u307e\u3059\u3002
======================================================================
\u25a0\u30e1\u30f3\u30c6\u30ca\u30f3\u30b9\u60c5\u5831\u63b2\u8f09\u5148
Confluence\u306e\u4ee5\u4e0b\u30da\u30fc\u30b8\u306b\u6bce\u6708\u306e\u30e1\u30f3\u30c6\u30ca\u30f3\u30b9\u60c5\u5831\u3092\u63b2\u8f09\u3057\u3066\u304a\u308a\u307e\u3059\u306e\u3067
\u3054\u78ba\u8a8d\u4e0b\u3055\u3044\u3002
\u30fbIIJ\u793e\u5185\u30b5\u30fc\u30d0Windows Update\u30b9\u30b1\u30b8\u30e5\u30fc\u30eb
- https://cf.iij-group.jp/pages/viewpage.action?pageId=36534214
in the thread view, we see assuming Content-Transfer-Encoding: 7bit
and japanese titles are displayed correctly but when I open the message, the title is also displayed correcrlty but then it seems to assume 8 bit encoding. I wonder if that's the problem etc...
@teto can you provide a (trimmed) mail file that we can add to the tests? compare #1359.
looking at the log for another issue (some kind of plugin I develop), I've also noticed these:
DEBUG:utils:Content-Transfer-Encoding: "8bit"
DEBUG:utils:assuming Content-Transfer-Encoding: 8bit
DEBUG:utils:Decoding failure: 'utf-8' codec can't decode byte 0xe9 in position 14535: invalid continuation byte
Aren't there any already available in notmuch testsuite (would be in notmuch-0.28/test/corpora ) ? my message appears fine with notmuch. I can share the problematic mails via mails/irc but due to my company policies and other logistical issues, it's hard to share these mails with the test database.
I did not look into the notmuch test suite. You should not share the original mails. As you can see in #1359 (which I referenced above) I just constructed a minimal mail without any valuable information that shows the problem at hand. Now my request for other mail files that trigger this bug was actually for minimal and fully anonymized mails so that we can put them into the test suite.
@lucc I've tried to come up with a minimal test here https://github.com/pazz/alot/pull/1369. I could not really "test" the test, yet I hope it's enough for alot contributors to find the issue.
Another msg with Umlauts, which displays fine with notmuch but shows garbled Umlauts in alot. I grepped for Content headers:
~ $ notmuch show --format=raw id:[email protected] | grep -i content
Content-Language: en-US
Content-Type: multipart/mixed; boundary="===============5042742156636891870=="
Content-Type: multipart/alternative;
Content-Language: en-US
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Content-Disposition: inline
@varac: can you share an anonymized but full email to include as test case?
Hello, I dug a big into how this problem shows up to me, and I hope to not be misleading for the context of this issue.
In my case the problem is not with the text/plain
part, but rather with the text/html
part of a message. When I use the version of alot in Debian 9 (0.3.6), elinks renders the message just fine. When I use the version of alot in the current master branch (d0297605c0ec1c6b65f541d0fd5b69ac5a0f4ded), elinks renders the message with a � character.
This is an example message:
To: [email protected]
From: [email protected]
Date: Thu, 18 Apr 2019 10:33:56 -0300
Subject: Encoding example
Content-type: multipart/mixed; boundary="----------=_1555594448-4177-57454"
This is a multi-part message in MIME format...
------------=_1555594448-4177-57454
Content-Type: multipart/alternative;
boundary="------------7E9A5FC51A937A20CA2D0044"
Content-Language: pt-BR
This is a multi-part message in MIME format.
--------------7E9A5FC51A937A20CA2D0044
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Açúcar
--------------7E9A5FC51A937A20CA2D0044
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Açúcar</p>
</body>
</html>
--------------7E9A5FC51A937A20CA2D0044--
------------=_1555594448-4177-57454
Content-Type: text/plain; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
------------=_1555594448-4177-57454--
This is the ~/.mailcap
line i use for text/html
:
text/html; elinks -dump -dump-charset utf8 %s; nametemplate=%s.html; copiousoutput;
I'm using elinks from Debian 9 (0.12pre6)
This is how it shows in alot 3.0.6:
This is how it shows in alot from master branch (d0297605c0ec1c6b65f541d0fd5b69ac5a0f4ded):
If there are more details I can provide, please let me know.
If this should be the subject for another issue, please let me know and I'll create a different issue and remove this comment from here.
I have the same problem as @drebs but with German umlauts in html parts of a multipart message. It seems the problem occurs in db/utils.py:remove_cte()
:
https://github.com/pazz/alot/blob/f9689575e771aaa2e6cd4da13b71b88cf8e9e246/alot/db/utils.py#L419
Changing the encoding from raw-unicode-escape
to utf-8
fixes the problem for me and the example mail posted in the comment above.
https://github.com/pazz/alot/blob/f9689575e771aaa2e6cd4da13b71b88cf8e9e246/alot/db/utils.py#L425
There's a comment stating that
Python's mail library may decode 8bit as raw-unicode-escape, so we need to encode that back to bytes so we can decode it using the correct encoding, or it might not, in which case assume that the str representation we got is correct.
I have no idea if this is still valid and my fix will produce other errors. All tests pass.
I can confirm that @sgelb's fix also works for my case where Greek letters were rendered as unicode character sequences (the example below would render as \u0393\u03b5\u03b9\u03b1 \u03c3\u03b1\u03c2
).
A minimal example of such an email is:
MIME-Version: 1.0
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
<!DOCTYPE html>
<html lang="en"><body>Γεια σας</body></html>
Could someone please check if
- this is fixed by, or even related to #1485
- @sgelb's proposed change affects (de)cryption for PGP mails
pgp is very fragile when it comes to mail encodings and related standarts and I want to make sure we don't break it here.. Cheers!
@pazz answering your questions:
- In alot's master branch I still producing the problem, I haven't review #1485 but it doesn't look like it has fixed the problem.
- @sgelb's proposal fixes the problem to me and I'm able to decrypt PGP emails correctly.
OK fine. Thanks for keeping this alive. Will someone please send a PR?
@sgelb as you did the proposal, do you want to send the PR? If not I can send it myself, I'll wait a couple of days for @sgelb reply and they don't answer before I'll do it this weekend.
@sgelb as you did the proposal, do you want to send the PR?
I don't use alot anymore, so I'd prefer if someone else would send the PR.
I belive this was solved by: 37395809db473fb9a4157084a5b1ea3165914556
I think this issue can be closed.