truffleruby icon indicating copy to clipboard operation
truffleruby copied to clipboard

Kernel#format does not preserve encoding

Open kirs opened this issue 3 years ago • 8 comments

(discovered this while pairing with @chrisseaton on https://github.com/oracle/truffleruby/pull/2308)

puts RUBY_DESCRIPTION
source = format('%s', 'foobar'.encode('utf-16le'))
puts source.encoding

On MRI:

ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-darwin19]
UTF-16LE

On TrufflyRuby:

truffleruby 21.1.0-dev-70fe61f3, like ruby 2.7.2, GraalVM CE JVM [x86_64-darwin]
UTF-8

kirs avatar Apr 07 '21 09:04 kirs

Pack and format seem in general to possibly not understand anything but the most basic encodings.

chrisseaton avatar Apr 07 '21 11:04 chrisseaton

I looked into more corner cases on MRI here.

puts RUBY_DESCRIPTION

ascii = 'ascii'.encode('us-ascii')
utf16 = 'utf16'.encode('utf-16le')
win1251 = 'win1251'.encode('windows-1251')

puts "us-ascii + win-2151 = "
puts format('%s %s', ascii, win1251).encoding.inspect

puts "utf16 = "
puts format('%s', utf16).encoding.inspect

puts "ascii + utf-16="
puts format('%s %s', ascii, utf16).encoding.inspect
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-darwin19]
us-ascii + win-2151 =
#<Encoding:UTF-8>
utf16 =
#<Encoding:UTF-16LE>
ascii + utf-16=
Traceback (most recent call last):
	1: from demo.rb:16:in `<main>'
demo.rb:16:in `format': incompatible character encodings: UTF-8 and UTF-16LE (Encoding::CompatibilityError)

It does some kind of math where ascii + win-1251 results to utf8 and utf16 equals utf16, but ascii + utf-16 is not compatible.

That's defined in rb_enc_compatible (https://github.com/ruby/ruby/blob/8a4472fb6d2df0f6407cef24df6a038be90d1462/encoding.c#L1172-L1185) which returns an error or one encoding out of two that's a superset.

I'm guessing that the result string of format should do logic similar to rb_enc_compatible and calculate result encoding that would be compatible with all inputs.

kirs avatar Apr 14 '21 10:04 kirs

Yes that sounds likely, but figuring out when it should do it is often tricky.

Also compare with what JRuby does - as if they're correct it can sometimes be easier to understand their code.

chrisseaton avatar Apr 14 '21 11:04 chrisseaton

Related JRuby commit: https://github.com/jruby/jruby/commit/bb90d3b7644316f8ae6b92e02defdf3838854fb5

kirs avatar Apr 16 '21 14:04 kirs

We have NegotiateCompatibleEncodingNode (and a couple nodes using that) to find an Encoding compatible for 2 encodings. That's use for Encoding.compatible? in Ruby, and should be the same as rb_enc_compatible().

eregon avatar Apr 16 '21 15:04 eregon

Thanks!

Do you have an idea why PrintfCompiler is using FormatEncoding (which only supports ASCII and UTF)? Should I change that to use common Encoding?

kirs avatar Apr 16 '21 15:04 kirs

I'm not sure. Yes, using the JCodings Encoding instead would be best.

eregon avatar Apr 16 '21 15:04 eregon

I think it was possibly done to make the format package independent of the rest of TruffleRuby.

chrisseaton avatar Apr 16 '21 16:04 chrisseaton