Segfault in parallel processing
I have a system that streams a very large newline oriented JSON into a multiple line chunks and then puts those multiline chunks into a SizedQueue which is then processed In separate processes with the parallel gem. Inside those sub processes those lines of json text are parsed with oj and we just noticed that we were getting what looks like segmentation faults inside oj.
This may have been going on a while, since this is deep inside a much larger system, and at the moment I'm not able to generate an exact test case since this is the middle of processing a few lines of a much larger file. I do have ruby segfault traces though. Here is one of them. This is on oj 3.13.11 with ruby 2.7.3. on Ubuntu 18.04.
I'm happy to attempt some experiments if there are any pointers that may be needed or answer any questions. Since this happens fairly infrequently with relation to the data volume, I'm shooting in the dar here and guessing that this probably has to do with a combination of possibly some malformed json + reuse of some C structure + parallel processing with copy on write.
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:456: [BUG] Segmentation fault at 0x0000000000000000
ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0010 p:---- s:0060 e:000059 CFUNC :fork
c:0009 p:0048 s:0056 e:000055 METHOD /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:456
c:0008 p:0045 s:0044 e:000040 BLOCK /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:440 [FINISH]
c:0007 p:---- s:0036 e:000035 CFUNC :synchronize
c:0006 p:0009 s:0032 e:000031 METHOD /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:433
c:0005 p:0060 s:0023 e:000022 BLOCK /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:404 [FINISH]
c:0004 p:---- s:0016 e:000015 CFUNC :loop
c:0003 p:0034 s:0012 e:000011 BLOCK /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:398
c:0002 p:0005 s:0006 e:000005 BLOCK /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:222 [FINISH]
c:0001 p:---- s:0003 e:000002 (none) [FINISH]
-- Ruby level backtrace information ----------------------------------------
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:222:in `block (4 levels) in in_threads'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:398:in `block (2 levels) in work_in_processes'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:398:in `loop'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:404:in `block (3 levels) in work_in_processes'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:433:in `replace_worker'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:433:in `synchronize'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:440:in `block in replace_worker'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:456:in `worker'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:456:in `fork'
-- Machine register context ------------------------------------------------
RIP: 0x00007f113a8e6910 RBP: 0x000055c81250c840 RSP: 0x00007f110a4f0da8
RAX: 0x0000000000000000 RBX: 0x00007f1128f12d90 RCX: 0x00007f1128f12d80
RDX: 0x00007f112bd6e530 RDI: 0x00007f1128f12d90 RSI: 0x0000000000000001
R8: 0x000055c81250ca50 R9: 0x000000549e773d2b R10: 0x0000000000000000
R11: 0x0000000000000000 R12: 0x000055c81253ad88 R13: 0x00007f110a4f0e10
R14: 0x000055c81250deb0 R15: 0x000055c81250c840 EFL: 0x0000000000010246
-- C level backtrace information -------------------------------------------
2022-01-16T04:18:49Z 10036 INFO : Worker 22 start processing 76D258 with 107 events
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_bugreport+0x7ce) [0x7f113aee81de] vm_dump.c:755
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_bug_for_fatal_signal+0xe7) [0x7f113ad078f7] error.c:660
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(sigsegv+0x4b) [0x7f113ae4c06b] signal.c:946
/lib/x86_64-linux-gnu/libc.so.6(0x7f113a88b040) [0x7f113a88b040]
/lib/x86_64-linux-gnu/libc.so.6(malloc_usable_size+0x40) [0x7f113a8e6910]
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(objspace_xfree+0x5) [0x7f113ad2c472] gc.c:9694
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(ruby_sized_xfree) gc.c:10178
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(ruby_xfree) gc.c:10185
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/oj-3.13.11/lib/oj/oj.so(free_doc_cb+0x21) [0x7f112bd6e5b1] fast.c:673
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(run_final+0xb) [0x7f113ad25869] gc.c:3372
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(finalize_list) gc.c:3391
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_objspace_call_finalizer+0x2f0) [0x7f113ad32540] gc.c:3536
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_ec_finalize+0x2d) [0x7f113ad14f6e] eval.c:152
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_ec_cleanup) eval.c:256
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(ruby_stop+0x9) [0x7f113ad15129] eval.c:290
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_f_fork+0x74) [0x7f113adfbd74] process.c:4158
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0xc9) [0x7f113aec484a] vm_insnhelper.c:2514
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_sendish+0x58) [0x7f113aecff6d] vm_insnhelper.c:4023
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_exec+0xab) [0x7f113aed629b] vm.c:1920
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_iseq_block_from_c+0x98) [0x7f113aed6f8e] vm.c:1116
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_block_from_c_bh) vm.c:1134
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_yield) vm.c:1179
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_yield_0) vm_eval.c:1227
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_yield_1) vm_eval.c:1233
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_yield) vm_eval.c:1243
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_ensure+0xd4) [0x7f113ad126e4] eval.c:1129
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0xc9) [0x7f113aec484a] vm_insnhelper.c:2514
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_sendish+0x58) [0x7f113aecff6d] vm_insnhelper.c:4023
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_exec+0xab) [0x7f113aed629b] vm.c:1920
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_block+0x153) [0x7f113aee260f] vm.c:1044
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_iseq_block_from_c) vm.c:1116
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_block_from_c_bh) vm.c:1134
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_yield) vm.c:1179
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_yield_0) vm_eval.c:1227
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(loop_i) vm_eval.c:1330
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vrescue2+0xd1) [0x7f113ad121b1] eval.c:990
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_rescue2+0x8a) [0x7f113ad123aa] eval.c:967
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0xc9) [0x7f113aec484a] vm_insnhelper.c:2514
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_sendish+0x58) [0x7f113aecff6d] vm_insnhelper.c:4023
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_exec+0xab) [0x7f113aed629b] vm.c:1920
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_iseq_block_from_c+0x183) [0x7f113aeda263] vm.c:1116
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_block_from_c_proc) vm.c:1216
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_invoke_proc) vm.c:1238
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_invoke_proc) vm.c:1259
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(thread_do_start+0x289) [0x7f113ae8fe09] thread.c:692
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(thread_start_func_2+0x257) [0x7f113ae92187] thread.c:740
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_native_cond_initialize+0x0) [0x7f113ae927bb] thread_pthread.c:969
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(register_cached_thread_and_wait) thread_pthread.c:1021
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(thread_start_func_1) thread_pthread.c:976
/lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xdb) [0x7f113a4176db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f113a96d71f]
I'd like to get more information on this. I would be concerned if memory is being shared across multiple forked processes. That could really be a problem if any caching is being used. Something to consider if you are calling Oj in multiple forked applications and want to use caching would be to only require Oj after forking. I suspect even globals such as classes might move or be different in the forked workers.
No response, closing.
Totally reasonable to close 😄 - I did not get back in a reasonable time. I cannot regularly replicate this and haven't had the dedicated time to provide you the feedback you need. If/when I do get back to this I'll file a new issue.
Sounds good, thanks.
@eregon from our conversation at rubyconf - for future reference.
We saw that oj disables and re-enables the GC around parse_json, that's quite a hack 😅 From a quick look RB_GC_GUARD(str) should be used instead, and there is probably no need to memcpy() the string to parse. Might need to pin the string to avoid the char* moving maybe, cc @tenderlove. Creating the data object might trigger GC which could explain the crash seen here (even if the GC is disabled, for instance if malloc() failed or so). Also the data->json field should be assigned in the struct before allocating the object, so it's not NULL if the finalizer or marking runs early
You probably noticed that the default to halt GC is to not stop GC and it is optional. It was added quite a while ago to support libraries that don't guard.
The string is copied because at the time strings were not immutable. Even now depending on the way Ruby is run the string can still be modified unless I have missed some changes that are in some minor version changes.
Good point on the data->json assignment though, I think. Not sure if it can occur but better safe than sorry.
Correction. I think you were referring to the fast.c file. Different story on that one. I'd have to look up the issue on that one but I think it was a user reading a json document and then reading a second one which either modified the string or caused a GC during the copy. I didn't think that could happen since the string passed in was on the call stack but that seemed to fix it. RB_GC_GUARD didn't see to fix it but that was quite a while ago so things may have changed.
I'll spend a little time with fast.c to see if I can cleanup without introducing a crash.
Yes we were looking at fast.c
Regarding the string changing, how about rb_str_dup() (or just calling dup)? That uses copy-on-write and so should be faster as it does avoid the copy unless the string is actually mutated.
rb_str_dup() sound like the right answer. I still need to remove the GC disable and get a failing test so I can try the guard. This weekend I hope.
I was not able to get a failure with the gc_disable commented out. I might just go with the suggested changes and compare the performance difference. If you have a test that fails that you can share that would be helpful.
@ohler55 @eregon I've made 2 branches in my fork that have a failing test case for me, and a solution for me. See #830 for reference. I hope this helps figure out a solution.