mysql2 icon indicating copy to clipboard operation
mysql2 copied to clipboard

Test suite stuck on `GC.verify_compaction_references` on ppc64le

Open jackorp opened this issue 2 years ago • 8 comments

While trying to execute the test suite of mysql2 gem on Fedora's ppc64le builders with Ruby 3.1, it gets stuck on spec_helper.rb.

This is fixed by commenting out this: https://github.com/brianmario/mysql2/blob/e9c662912dc3bd3707e6c7f0c75e591294cffe12/spec/spec_helper.rb#L10

This seems related to https://github.com/ged/ruby-pg/issues/423 and https://bugs.ruby-lang.org/issues/18560 .

jackorp avatar May 06 '22 17:05 jackorp

Example of such build: https://koji.fedoraproject.org/koji/taskinfo?taskID=86641565 Full log: https://gist.github.com/jackorp/f1a0eb00ba616fe52bea594bc2d18be3

jackorp avatar May 06 '22 17:05 jackorp

While trying to execute the test suite of mysql2 gem on Fedora's ppc64le builders with Ruby 3.1, it gets stuck on spec_helper.rb.

It's weird. According to [1], GC.verify_compaction_references in Ruby 3.1 should raise NotImplementedError on the platforms that cannot support GC compaction like ppc64le, rather than stuck. The patch [2] is applied to Ruby 3.1.

[1] https://bugs.ruby-lang.org/issues/18560#note-1 [2] https://github.com/ruby/ruby/commit/fc832ffbfaf581ff63ef40dc3f4ec5c8ff39aae6

junaruga avatar May 09 '22 09:05 junaruga

Also it should definitely crash. MySQL2 isn't compaction friendly yet, we need to introduce this PR: https://github.com/brianmario/mysql2/pull/1192 and possibly other changes

tenderlove avatar May 10 '22 00:05 tenderlove

Actually I'm totally wrong. We just pin references in mysql2, so everything should work correctly. 🤔

tenderlove avatar May 10 '22 00:05 tenderlove

I was able to request a ppc64le machine and I think I reproduced the issue. TL;DR it seems like a Ruby 3.1 issue with how this method is implemented.

My only guess is that calling GC.verify_compaction_references(double_heap: true, toward: :empty), like this test suite does, makes the GC code travel in some paths it should not have and then it is stuck in an infinite loop around while loop [0] in newobj_slowpath when attempting to allocate new space (I have seen only 2 allocation methods for String and Array in C backtraces I investigated so far).

To the issue itself. Commenting out the call to GC "fixes" it but if I rescue it instead then Ruby goes into infinite loop around GC code with ractors AFAICT.

Rescuing the exception (as in the following patch) and just going on with the code makes the issue surface in this test suite:

diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb
index 2e86e11..a9ccacf 100644
--- a/spec/spec_helper.rb
+++ b/spec/spec_helper.rb
@@ -7,7 +7,11 @@ DatabaseCredentials = YAML.load_file('spec/configuration.yml')
 if GC.respond_to?(:verify_compaction_references)
   # This method was added in Ruby 3.0.0. Calling it this way asks the GC to
   # move objects around, helping to find object movement bugs.
-  GC.verify_compaction_references(double_heap: true, toward: :empty)
+  begin
+    GC.verify_compaction_references(double_heap: true, toward: :empty)
+  rescue NotImplementedError
+    puts 'compaction not supported (caught exception)'
+  end
 end
 
 RSpec.configure do |config|

GDB Backtrace: https://gist.github.com/jackorp/cc5c4bae9b8f492f5b936cdbd86febdc

[0] The infinite loop occurs in this while loop: https://github.com/ruby/ruby/blob/v3_1_2/gc.c#L2483

jackorp avatar May 10 '22 10:05 jackorp

Thanks for your great investigation! Does this stuck even happen by the following command on the ppc64le machine you tested?

$ ruby -e 'GC.verify_compaction_references(double_heap: true, toward: :empty)'

junaruga avatar May 11 '22 12:05 junaruga

Does this stuck even happen by the following command on the ppc64le machine you tested?

Unfortunately, I had great issues reproducing the infinite loop outside of the test suite (Maybe somewhere it loads bits that allow for observing it).

As a note, the code:

$ ruby -e 'GC.verify_compaction_references(double_heap: true, toward: :empty)'

would fail with a proper exception. some begin/rescue block around that code piece is needed to even attempt a reproducer.

I was sometimes successful in getting it stuck in IRB but that was very unreliable and mostly random (but running the GC... method there was still required).

jackorp avatar May 11 '22 12:05 jackorp

Okay. Thanks for the info. If we can find a minimal reproducer, it's helpful to report it to the Ruby project, and for someone to fix the issue and add a unit test in the Ruby project.

junaruga avatar May 13 '22 13:05 junaruga

Let's close this, the problem was addressed in Ruby itself sometime back in https://bugs.ruby-lang.org/issues/18829

AFAICT, this was fixed upstream in newer Rubies (at least 3.2 onward).

JFTR, in fedora we have backported the patches onto 3.1 via the following commits: https://src.fedoraproject.org/rpms/ruby/c/b7b547379654b3a337010d15914139e158e59acb https://src.fedoraproject.org/rpms/ruby/c/ca94aff023c5779dec1e03094784bdf736beca83

jackorp avatar Oct 26 '23 09:10 jackorp