mail
mail copied to clipboard
Mail::Body#to_s uses wrong String encoding
Given an email encoded with a ISO-8859-1 charset, I'd expect the strings coming out of mail methods to be either:
- Encoded with ISO-8859-1
- Converted correctly to UTF-8
However, we see them returned with ASCII-8BIT encodings which doesn't seem right - and which causes problems on subsequent handling of those strings.
How to reproduce
The following script illustrates the issue:
require "mail"
mail = Mail.new('
Content-Type: multipart/alternative; boundary="_000_AM0PR09MB233743E6638C8665A901BF08CA220AM0PR09MB2337eurp_"
--_000_AM0PR09MB233743E6638C8665A901BF08CA220AM0PR09MB2337eurp_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
R=F8dgr=F8d med fl=F8de p=E5!
')
text_part = mail.text_part
charset = text_part.charset
puts "charset: #{charset.inspect}"
body_string = text_part.body.to_s
puts "encoding: #{body_string.encoding.inspect}"
# This triggers a conversion of the string to UTF-8, which then fails.
require "json"
JSON.generate({body: body_string})
Running it outputs
charset: "iso-8859-1"
encoding: #<Encoding:ASCII-8BIT>
Traceback (most recent call last):
2: from repro2.rb:22:in `<main>'
1: from ~/.rvm/gems/ruby-2.6.5/gems/json-2.3.0/lib/json/common.rb:224:in `generate'
~/.rvm/gems/ruby-2.6.5/gems/json-2.3.0/lib/json/common.rb:224:in `generate': "\xF8" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
While the exception comes from inside the JSON gem, I'd risk assessing that the root cause is because body_string.encoding
is #<Encoding:ASCII-8BIT>
and not #<Encoding:ISO-8859-1>
. To verify this we can add a
body_string.force_encoding(charset)
before performing the JSON encoding, which makes the script run without exceptions.
Versions
- mail 2.8.0.edge
- ruby 2.6.5 and 2.7
Possibly related issues
I did look through the existing issues, and while I wasn't able to find any that matches the issue exactly, there are quite a few that seem related. Many are older, though, and/or missing reproduction steps:
- https://github.com/mikel/mail/issues/809
- https://github.com/mikel/mail/issues/1216
- https://github.com/mikel/mail/pull/921
- https://github.com/mikel/mail/issues/902
This was originally reported in https://github.com/mikel/mail/issues/809 but closed based on a misunderstanding of later comments.
Nicely spotted, that is the exact issue we're seeing.
And misunderstandings so easily occur here, when we're talking about encodings, but not those encodings, the other encoding, which is really charset. 🙄 And it's especially hard because you can't visually tell the difference and in so many cases everything still works even though it is wrong.
It does seem, though, that using Mail::Message#decoded
and Mail::Part#decoded
, as @jeremy suggests in https://github.com/mikel/mail/issues/809#issuecomment-301986573, works. Perhaps not entirely as expected as it always returns a UTF-8 encoded String
, but that does satisfy my expectation number 2 above.
That said, it still seems problematic that Mail::Body#decoded
can return strings with a decidedly incorrect encoding, but I wonder if this might be more an issue of improving the docs around #decoded
?
It does seem, though, that using
Mail::Message#decoded
andMail::Part#decoded
, as @jeremy suggests in #809 (comment), works.
Only if the email is multi-part. Otherwise, Mail::Message.parts
returns []
.
Perhaps not entirely as expected as it always returns a UTF-8 encoded
String
, but that does satisfy my expectation number 2 above.That said, it still seems problematic that
Mail::Body#decoded
can return strings with a decidedly incorrect encoding, but I wonder if this might be more an issue of improving the docs around#decoded
?
I agree, it's broken.
>> s = "Delivered-To: [email protected]\r\nFrom: Ben Langfeld <[email protected]>\r\nContent-Type: text/plain;\r\n\tcharset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\nMime-Version: 1.0 (Mac OS X Mail 13.4 \\(3608.120.23.2.1\\))\r\nSubject: Test\r\nMessage-Id: <[email protected]>\r\nTo: [email protected]\r\n\r\n=F0=9F=91=8D\r\n\r\nBen=\r\n"
=> "Delivered-To: [email protected]\r\nFrom: Ben Langfeld <[email protected]>\r\nContent-Type: text/plain;\r\n\tcharset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\nMime-Version: 1.0 (Mac OS X Mail 13.4 \\(3608.120.23.2.1\\))\r\nSubject: Test\r\nMessage-Id: <[email protected]>\r\nTo: [email protected]\r\n\r\n=F0=9F=91=8D\r\n\r\nBen=\r\n"
>> mail = Mail.read_from_string(s)
=> #<Mail::Message:70158438295940, Multipart: false, Headers: <From: Ben Langfeld <[email protected]>>, <To: [email protected]>, <Message-ID: <[email protected]>>, <Subject: Test>, <Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.1\))>, <Content-Type: text/plain; charset=utf-8>, <Content-Transfer-Encoding: quoted-printable>, <Delivered-To: [email protected]>>
>> mail.body
=> #<Mail::Body:0x00007f9e118bc9d0 @boundary=nil, @preamble=nil, @epilogue=nil, @charset="US-ASCII", @part_sort_order=["text/plain", "text/enriched", "text/html", "multipart/alternative"], @parts=[], @raw_source="=F0=9F=91=8D\r\n\r\nBen=\r\n", @ascii_only=true, @encoding="quoted-printable">
>> mail.body.decoded
=> "\xF0\x9F\x91\x8D\r\n\r\nBen"
>> mail.decoded
=> "👍\r\n\r\nBen"
>> mail.parts
=> []
>> mail.part
=> [#<Mail::Part:70158574799800, Multipart: false, Headers: <Content-Type: text/plain;>>, #<Mail::Part:70158574804860, Multipart: false, Headers: >]
>> mail.part[0].decoded
=> "\xF0\x9F\x91\x8D\r\n\r\nBen"
I'm running this issue too, while processing a message sent into the app from Cloudmailin.
Encoding::UndefinedConversionError: "\xC2" from ASCII-8BIT to UTF-8
So what's the solution?
What do we expect decoded
to return in these cases? How can we fix it so that it does so?
(And why is this 5-year-old (#1216), very serious, app-crashing bug still unresolved?)
Not a proper solution to the underlying problem, but here's the quick workaround I came up with, in case it's helpful to anyone...
html = fix_encoding(@inbound.html_part)
private def fix_encoding(part)
body_str = part.body.decoded
# Why is the encoding sometimes ASCII-8BIT (and sometimes already converted to UTF-8) instead of
# respecting the claimed encoding? Why does trying to encode it as UTF-8 result in:
# Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8)? We may never know. (See
# https://github.com/mikel/mail/issues/1413, etc.) But here is a workaround.
logger.debug "body_str.encoding: #{body_str.encoding} (part.charset: #{part.charset})"
unless body_str.encoding == Encoding.find('utf-8')
new_encoding =
if (body_str.encode('utf-8') rescue false)
'utf-8'
else
part.charset
end
logger.warn "Using force_encoding to change from #{body_str.encoding} to #{new_encoding} (part.charset: #{part.charset})"
body_str = body_str.force_encoding(new_encoding)
end
body_str
end