mail Mail::Body#to_s uses wrong String encoding

Mail::Body#to_s uses wrong String encoding

Open koppen opened this issue 3 years ago • 6 comments

Given an email encoded with a ISO-8859-1 charset, I'd expect the strings coming out of mail methods to be either:

Encoded with ISO-8859-1
Converted correctly to UTF-8

However, we see them returned with ASCII-8BIT encodings which doesn't seem right - and which causes problems on subsequent handling of those strings.

How to reproduce

The following script illustrates the issue:

require "mail"

mail = Mail.new('
Content-Type: multipart/alternative; boundary="_000_AM0PR09MB233743E6638C8665A901BF08CA220AM0PR09MB2337eurp_"

--_000_AM0PR09MB233743E6638C8665A901BF08CA220AM0PR09MB2337eurp_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

R=F8dgr=F8d med fl=F8de p=E5!
')
text_part = mail.text_part

charset = text_part.charset
puts "charset: #{charset.inspect}"

body_string = text_part.body.to_s
puts "encoding: #{body_string.encoding.inspect}"

# This triggers a conversion of the string to UTF-8, which then fails.
require "json"
JSON.generate({body: body_string})

Running it outputs

charset: "iso-8859-1"
encoding: #<Encoding:ASCII-8BIT>
Traceback (most recent call last):
	2: from repro2.rb:22:in `<main>'
	1: from ~/.rvm/gems/ruby-2.6.5/gems/json-2.3.0/lib/json/common.rb:224:in `generate'
~/.rvm/gems/ruby-2.6.5/gems/json-2.3.0/lib/json/common.rb:224:in `generate': "\xF8" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

While the exception comes from inside the JSON gem, I'd risk assessing that the root cause is because body_string.encoding is #<Encoding:ASCII-8BIT> and not #<Encoding:ISO-8859-1>. To verify this we can add a

body_string.force_encoding(charset)

before performing the JSON encoding, which makes the script run without exceptions.

Versions

mail 2.8.0.edge
ruby 2.6.5 and 2.7

Possibly related issues

I did look through the existing issues, and while I wasn't able to find any that matches the issue exactly, there are quite a few that seem related. Many are older, though, and/or missing reproduction steps:

https://github.com/mikel/mail/issues/809
https://github.com/mikel/mail/issues/1216
https://github.com/mikel/mail/pull/921
https://github.com/mikel/mail/issues/902

Sep 25 '20 11:09 koppen

This was originally reported in https://github.com/mikel/mail/issues/809 but closed based on a misunderstanding of later comments.

Nov 04 '20 01:11 benlangfeld

Nicely spotted, that is the exact issue we're seeing.

And misunderstandings so easily occur here, when we're talking about encodings, but not those encodings, the other encoding, which is really charset. 🙄 And it's especially hard because you can't visually tell the difference and in so many cases everything still works even though it is wrong.

Nov 04 '20 07:11 koppen

It does seem, though, that using Mail::Message#decoded and Mail::Part#decoded, as @jeremy suggests in https://github.com/mikel/mail/issues/809#issuecomment-301986573, works. Perhaps not entirely as expected as it always returns a UTF-8 encoded String, but that does satisfy my expectation number 2 above.

That said, it still seems problematic that Mail::Body#decoded can return strings with a decidedly incorrect encoding, but I wonder if this might be more an issue of improving the docs around #decoded?

Nov 04 '20 08:11 koppen

It does seem, though, that using Mail::Message#decoded and Mail::Part#decoded, as @jeremy suggests in #809 (comment), works.

Only if the email is multi-part. Otherwise, Mail::Message.parts returns [].

Perhaps not entirely as expected as it always returns a UTF-8 encoded String, but that does satisfy my expectation number 2 above.

That said, it still seems problematic that Mail::Body#decoded can return strings with a decidedly incorrect encoding, but I wonder if this might be more an issue of improving the docs around #decoded?

I agree, it's broken.

>> s = "Delivered-To: [email protected]\r\nFrom: Ben Langfeld <[email protected]>\r\nContent-Type: text/plain;\r\n\tcharset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\nMime-Version: 1.0 (Mac OS X Mail 13.4 \\(3608.120.23.2.1\\))\r\nSubject: Test\r\nMessage-Id: <[email protected]>\r\nTo: [email protected]\r\n\r\n=F0=9F=91=8D\r\n\r\nBen=\r\n"
=> "Delivered-To: [email protected]\r\nFrom: Ben Langfeld <[email protected]>\r\nContent-Type: text/plain;\r\n\tcharset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\nMime-Version: 1.0 (Mac OS X Mail 13.4 \\(3608.120.23.2.1\\))\r\nSubject: Test\r\nMessage-Id: <[email protected]>\r\nTo: [email protected]\r\n\r\n=F0=9F=91=8D\r\n\r\nBen=\r\n"
>> mail = Mail.read_from_string(s)
=> #<Mail::Message:70158438295940, Multipart: false, Headers: <From: Ben Langfeld <[email protected]>>, <To: [email protected]>, <Message-ID: <[email protected]>>, <Subject: Test>, <Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.1\))>, <Content-Type: text/plain;	charset=utf-8>, <Content-Transfer-Encoding: quoted-printable>, <Delivered-To: [email protected]>>
>> mail.body
=> #<Mail::Body:0x00007f9e118bc9d0 @boundary=nil, @preamble=nil, @epilogue=nil, @charset="US-ASCII", @part_sort_order=["text/plain", "text/enriched", "text/html", "multipart/alternative"], @parts=[], @raw_source="=F0=9F=91=8D\r\n\r\nBen=\r\n", @ascii_only=true, @encoding="quoted-printable">
>> mail.body.decoded
=> "\xF0\x9F\x91\x8D\r\n\r\nBen"
>> mail.decoded
=> "👍\r\n\r\nBen"
>> mail.parts
=> []
>> mail.part
=> [#<Mail::Part:70158574799800, Multipart: false, Headers: <Content-Type: text/plain;>>, #<Mail::Part:70158574804860, Multipart: false, Headers: >]
>> mail.part[0].decoded
=> "\xF0\x9F\x91\x8D\r\n\r\nBen"

Nov 04 '20 11:11 benlangfeld

I'm running this issue too, while processing a message sent into the app from Cloudmailin.

Encoding::UndefinedConversionError: "\xC2" from ASCII-8BIT to UTF-8

So what's the solution?

What do we expect decoded to return in these cases? How can we fix it so that it does so?

(And why is this 5-year-old (#1216), very serious, app-crashing bug still unresolved?)

Jan 13 '23 18:01 TylerRick

Not a proper solution to the underlying problem, but here's the quick workaround I came up with, in case it's helpful to anyone...

  html = fix_encoding(@inbound.html_part)

  private def fix_encoding(part)
    body_str = part.body.decoded
    # Why is the encoding sometimes ASCII-8BIT (and sometimes already converted to UTF-8) instead of
    # respecting the claimed encoding? Why does trying to encode it as UTF-8 result in:
    # Encoding::UndefinedConversionError ("\xE2" from ASCII-8BIT to UTF-8)? We may never know. (See
    # https://github.com/mikel/mail/issues/1413, etc.) But here is a workaround.
    logger.debug "body_str.encoding: #{body_str.encoding} (part.charset: #{part.charset})"
    unless body_str.encoding == Encoding.find('utf-8')
      new_encoding =
        if (body_str.encode('utf-8') rescue false)
          'utf-8'
        else
          part.charset
        end
      logger.warn "Using force_encoding to change from #{body_str.encoding} to #{new_encoding} (part.charset: #{part.charset})"
      body_str = body_str.force_encoding(new_encoding)
    end
    body_str
  end

Jan 24 '23 00:01 TylerRick

mail mail copied to clipboard

Mail::Body#to_s uses wrong String encoding

How to reproduce

Versions

Possibly related issues

mail
mail copied to clipboard