fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Behaviour of `IO#set_encoding` has been changed as of Ruby 3.3

Open ashie opened this issue 2 years ago • 8 comments

Describe the bug

2 tests of child_process helper always fail in Ruby 3.3.0-dev after https://redmine.ruby-lang.org/issues/18899 is landed.

===============================================================================================================================================================
Failure: test: can scrub characters without exceptions(ChildProcessTest)
/home/aho/Projects/Fluentd/fluentd/test/plugin_helper/test_child_process.rb:531:in `block (2 levels) in <class:ChildProcessTest>'
     528:       end
     529:       sleep TEST_WAIT_INTERVAL_FOR_BLOCK_RUNNING until m.locked? || ran
     530:       m.lock
  => 531:       assert_equal Encoding.find('utf-8'), str.encoding
     532:       expected = "\xEF\xBF\xBD\xEF\xBF\xBD\x00\xEF\xBF\xBD\xEF\xBF\xBD".force_encoding("utf-8")
     533:       assert_equal expected, str
     534:       @d.stop; @d.shutdown; @d.close; @d.terminate
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:189:in `block in timeout'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `block in catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:198:in `timeout'
/home/aho/Projects/Fluentd/fluentd/test/plugin_helper/test_child_process.rb:520:in `block in <class:ChildProcessTest>'
<#<Encoding:UTF-8>> expected but was
<#<Encoding:ASCII-8BIT>>

diff:
? #<Encoding:UTF  -8   >
?            ASCII  BIT 
?            ???  +++ 
===============================================================================================================================================================
F
===============================================================================================================================================================
Failure: test: can scrub characters without exceptions and replace specified chars(ChildProcessTest)
/home/aho/Projects/Fluentd/fluentd/test/plugin_helper/test_child_process.rb:552:in `block (2 levels) in <class:ChildProcessTest>'
     549:       end
     550:       sleep TEST_WAIT_INTERVAL_FOR_BLOCK_RUNNING until m.locked? || ran
     551:       m.lock
  => 552:       assert_equal Encoding.find('utf-8'), str.encoding
     553:       expected = "??\x00??".force_encoding("utf-8")
     554:       assert_equal expected, str
     555:       @d.stop; @d.shutdown; @d.close; @d.terminate
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:189:in `block in timeout'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `block in catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:198:in `timeout'
/home/aho/Projects/Fluentd/fluentd/test/plugin_helper/test_child_process.rb:541:in `block in <class:ChildProcessTest>'
<#<Encoding:UTF-8>> expected but was
<#<Encoding:ASCII-8BIT>>

diff:
? #<Encoding:UTF  -8   >
?            ASCII  BIT 
?            ???  +++ 
===============================================================================================================================================================

To Reproduce

$ rbenv install 3.3.0-dev
$ rbenv local 3.3.0-dev
$ bundle exec rake test TEST=test/plugin_helper/test_child_process.rb

Expected behavior

These tests shouldn't fail.

Your Environment

- Fluentd version: 8bb38b5c5cb13d2d9446f94708940f83f56a4b6f
- TD Agent version: N/A
- Operating system: Ubuntu 22.04, Windows 10
- Kernel version: 5.15.0-60-generic

Your Configuration

N/A

Your Error Log

Described in `Describe the bug` section.

Additional context

No response

ashie avatar Feb 16 '23 01:02 ashie

Recently, it can cause a timeout.

  • https://github.com/fluent/fluentd/actions/runs/5108944850/jobs/9183231391?pr=4188

daipom avatar May 30 '23 03:05 daipom

After reading following articles, I've got the reason why these tests are failed on Ruby 3.3:

  • https://redmine.ruby-lang.org/issues/18899
  • https://github.com/ruby/ruby/pull/6280

IO#set_encoding ignores the second argument (int_enc) when ASCII-8bit is specified as ext_enc as of Ruby 3.3. I'm not sure but I'm now doubting reasonability of this Ruby's change because the documentation about Encoding Options implies possibility of transcoding ASCII-8BIT -> UTF-8 with invalid: :replace or undef: :replace. I'll open a new issue at https://redmine.ruby-lang.org/issues then ask about it. I'll describe the detail of this issue at there.

On the other hand, I'm now doubting the specification of Fluent::PluginHelper::ChildProcess#child_process_execute too. Although Fluentd's default internal & external encoding are ASCII-8BIT (workers are executed by ruby -Eascii-8bit:ascii-8bit), Fluent::PluginHelper::ChildProcess#child_process_execute uses UTF-8 as default internal encoding. They seem inconsistent. It would be better that Fluentd doesn't modify incoming data implicitly by default. How to modify data should be determined by each plugins & users.

So I'm not sure how to resolve this issue for now. I'm considering to mark these tests as pending until these problems are solved.

ashie avatar Jan 10 '24 02:01 ashie

Thanks!

I'm considering to mark these tests as pending until these problems are solved.

I agree.

daipom avatar Jan 10 '24 02:01 daipom

After further investigation, I found that probably it won't work as expected even on Ruby 3.2 or former.

test-set-conding.rb:

#!/usr/bin/env ruby
STDIN.set_encoding('ascii-8bit', 'utf-8', invalid: :replace, undef: :replace) 
str = STDIN.read
p str.encoding
p str
p str.force_encoding("ascii-8bit")

Results with the input string foo\xFFbar are following:

Ruby 3.2:

$ ruby -e 'STDOUT.write("foo\xFFbar")' | ruby test-set-conding.rb
#<Encoding:UTF-8>
"foo�bar"
"foo\xEF\xBF\xBDbar"
"foo�bar"

Ruby 3.3

$ ruby -e 'STDOUT.write("foo\xFFbar")' | ruby test-set-conding.rb
#<Encoding:ASCII-8BIT>
"foo\xFFbar"
"foo\xFFbar"
"foo\xFFbar"

Looking at this result alone, it appears to be working as expected on Ruby 3.2.

But when I test it with the input string あ\xFFい, the result seems questionable:

Ruby 3.2

$ ruby -e 'STDOUT.write("あ\xFFい")' | ruby test-set-conding.rb
#<Encoding:UTF-8>
"�������"
"\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD"
"�������"

Ruby 3.3

$ ruby -e 'STDOUT.write("あ\xFFい")' | ruby test-set-conding.rb
#<Encoding:ASCII-8BIT>
"\xE3\x81\x82\xFF\xE3\x81\x84"
"\xE3\x81\x82\xFF\xE3\x81\x84"
"あ\xFFい"

Probably the current Fluentd's test expects あ�い for this byte sequence、but it replaces all bytes to even if the input string includes valid UTF-8 sequence in actual.

ashie avatar Mar 25 '24 06:03 ashie

related issue: #4460

ashie avatar Apr 03 '24 07:04 ashie

Now I think Ruby 3.3's change seems better than before.

But when I test it with the input string あ\xFFい, the result seems questionable:

Probably the current Fluentd's test expects あ�い for this byte sequence、but it replaces all bytes to even if the input string includes valid UTF-8 sequence in actual.

This result on Ruby 3.2 isn't questionable, because :invalid and :undef are for internal_encoding, so that all non-ASCII characters should be replaced. But such conversion doesn't seem make sense in most cases for ASCII-8Bit (it means binary). It would be better not to modifier input data.

ashie avatar Apr 03 '24 08:04 ashie

Now I think Ruby 3.3's change seems better than before.

But when I test it with the input string あ\xFFい, the result seems questionable:

Probably the current Fluentd's test expects あ�い for this byte sequence、but it replaces all bytes to even if the input string includes valid UTF-8 sequence in actual.

This result on Ruby 3.2 isn't questionable, because :invalid and :undef are for internal_encoding, so that all non-ASCII characters should be replaced. But such conversion doesn't seem make sense in most cases for ASCII-8Bit (it means binary). It would be better not to modifier input data.

I see! Yes! I think so too!

daipom avatar Apr 03 '24 08:04 daipom

- internal_encoding
+ external_encoding

ashie avatar Apr 03 '24 08:04 ashie