oj icon indicating copy to clipboard operation
oj copied to clipboard

Segfault in parallel processing

Open copiousfreetime opened this issue 3 years ago • 1 comments

I have a system that streams a very large newline oriented JSON into a multiple line chunks and then puts those multiline chunks into a SizedQueue which is then processed In separate processes with the parallel gem. Inside those sub processes those lines of json text are parsed with oj and we just noticed that we were getting what looks like segmentation faults inside oj.

This may have been going on a while, since this is deep inside a much larger system, and at the moment I'm not able to generate an exact test case since this is the middle of processing a few lines of a much larger file. I do have ruby segfault traces though. Here is one of them. This is on oj 3.13.11 with ruby 2.7.3. on Ubuntu 18.04.

I'm happy to attempt some experiments if there are any pointers that may be needed or answer any questions. Since this happens fairly infrequently with relation to the data volume, I'm shooting in the dar here and guessing that this probably has to do with a combination of possibly some malformed json + reuse of some C structure + parallel processing with copy on write.

/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:456: [BUG] Segmentation fault at 0x0000000000000000
ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0010 p:---- s:0060 e:000059 CFUNC  :fork
c:0009 p:0048 s:0056 e:000055 METHOD /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:456
c:0008 p:0045 s:0044 e:000040 BLOCK  /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:440 [FINISH]
c:0007 p:---- s:0036 e:000035 CFUNC  :synchronize
c:0006 p:0009 s:0032 e:000031 METHOD /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:433
c:0005 p:0060 s:0023 e:000022 BLOCK  /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:404 [FINISH]
c:0004 p:---- s:0016 e:000015 CFUNC  :loop
c:0003 p:0034 s:0012 e:000011 BLOCK  /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:398
c:0002 p:0005 s:0006 e:000005 BLOCK  /ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:222 [FINISH]
c:0001 p:---- s:0003 e:000002 (none) [FINISH]

-- Ruby level backtrace information ----------------------------------------
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:222:in `block (4 levels) in in_threads'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:398:in `block (2 levels) in work_in_processes'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:398:in `loop'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:404:in `block (3 levels) in work_in_processes'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:433:in `replace_worker'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:433:in `synchronize'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:440:in `block in replace_worker'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:456:in `worker'
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/parallel-1.21.0/lib/parallel.rb:456:in `fork'

-- Machine register context ------------------------------------------------
 RIP: 0x00007f113a8e6910 RBP: 0x000055c81250c840 RSP: 0x00007f110a4f0da8
 RAX: 0x0000000000000000 RBX: 0x00007f1128f12d90 RCX: 0x00007f1128f12d80
 RDX: 0x00007f112bd6e530 RDI: 0x00007f1128f12d90 RSI: 0x0000000000000001
  R8: 0x000055c81250ca50  R9: 0x000000549e773d2b R10: 0x0000000000000000
 R11: 0x0000000000000000 R12: 0x000055c81253ad88 R13: 0x00007f110a4f0e10
 R14: 0x000055c81250deb0 R15: 0x000055c81250c840 EFL: 0x0000000000010246

-- C level backtrace information -------------------------------------------
2022-01-16T04:18:49Z 10036  INFO : Worker 22 start processing 76D258 with 107 events
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_bugreport+0x7ce) [0x7f113aee81de] vm_dump.c:755
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_bug_for_fatal_signal+0xe7) [0x7f113ad078f7] error.c:660
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(sigsegv+0x4b) [0x7f113ae4c06b] signal.c:946
/lib/x86_64-linux-gnu/libc.so.6(0x7f113a88b040) [0x7f113a88b040]
/lib/x86_64-linux-gnu/libc.so.6(malloc_usable_size+0x40) [0x7f113a8e6910]
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(objspace_xfree+0x5) [0x7f113ad2c472] gc.c:9694
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(ruby_sized_xfree) gc.c:10178
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(ruby_xfree) gc.c:10185
/ebs/apps/navpass/navpass-core/shared/bundle/ruby/2.7.0/gems/oj-3.13.11/lib/oj/oj.so(free_doc_cb+0x21) [0x7f112bd6e5b1] fast.c:673
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(run_final+0xb) [0x7f113ad25869] gc.c:3372
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(finalize_list) gc.c:3391
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_objspace_call_finalizer+0x2f0) [0x7f113ad32540] gc.c:3536
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_ec_finalize+0x2d) [0x7f113ad14f6e] eval.c:152
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_ec_cleanup) eval.c:256
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(ruby_stop+0x9) [0x7f113ad15129] eval.c:290
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_f_fork+0x74) [0x7f113adfbd74] process.c:4158
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0xc9) [0x7f113aec484a] vm_insnhelper.c:2514
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_sendish+0x58) [0x7f113aecff6d] vm_insnhelper.c:4023
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_exec+0xab) [0x7f113aed629b] vm.c:1920
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_iseq_block_from_c+0x98) [0x7f113aed6f8e] vm.c:1116
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_block_from_c_bh) vm.c:1134
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_yield) vm.c:1179
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_yield_0) vm_eval.c:1227
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_yield_1) vm_eval.c:1233
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_yield) vm_eval.c:1243
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_ensure+0xd4) [0x7f113ad126e4] eval.c:1129
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0xc9) [0x7f113aec484a] vm_insnhelper.c:2514
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_sendish+0x58) [0x7f113aecff6d] vm_insnhelper.c:4023
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_exec+0xab) [0x7f113aed629b] vm.c:1920
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_block+0x153) [0x7f113aee260f] vm.c:1044
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_iseq_block_from_c) vm.c:1116
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_block_from_c_bh) vm.c:1134
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_yield) vm.c:1179
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_yield_0) vm_eval.c:1227
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(loop_i) vm_eval.c:1330
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vrescue2+0xd1) [0x7f113ad121b1] eval.c:990
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_rescue2+0x8a) [0x7f113ad123aa] eval.c:967
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc_with_frame+0xc9) [0x7f113aec484a] vm_insnhelper.c:2514
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_call_cfunc) vm_insnhelper.c:2539
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_sendish+0x58) [0x7f113aecff6d] vm_insnhelper.c:4023
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_exec_core) insns.def:782
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_exec+0xab) [0x7f113aed629b] vm.c:1920
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_iseq_block_from_c+0x183) [0x7f113aeda263] vm.c:1116
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(invoke_block_from_c_proc) vm.c:1216
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(vm_invoke_proc) vm.c:1238
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_vm_invoke_proc) vm.c:1259
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(thread_do_start+0x289) [0x7f113ae8fe09] thread.c:692
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(thread_start_func_2+0x257) [0x7f113ae92187] thread.c:740
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(rb_native_cond_initialize+0x0) [0x7f113ae927bb] thread_pthread.c:969
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(register_cached_thread_and_wait) thread_pthread.c:1021
/opt/rbenv/versions/2.7.3/lib/libruby.so.2.7(thread_start_func_1) thread_pthread.c:976
/lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xdb) [0x7f113a4176db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f113a96d71f]

copiousfreetime avatar Jan 17 '22 18:01 copiousfreetime

I'd like to get more information on this. I would be concerned if memory is being shared across multiple forked processes. That could really be a problem if any caching is being used. Something to consider if you are calling Oj in multiple forked applications and want to use caching would be to only require Oj after forking. I suspect even globals such as classes might move or be different in the forked workers.

ohler55 avatar Feb 08 '22 00:02 ohler55

No response, closing.

ohler55 avatar Sep 15 '22 21:09 ohler55

Totally reasonable to close 😄 - I did not get back in a reasonable time. I cannot regularly replicate this and haven't had the dedicated time to provide you the feedback you need. If/when I do get back to this I'll file a new issue.

copiousfreetime avatar Sep 15 '22 21:09 copiousfreetime

Sounds good, thanks.

ohler55 avatar Sep 15 '22 21:09 ohler55

@eregon from our conversation at rubyconf - for future reference.

copiousfreetime avatar Nov 30 '22 20:11 copiousfreetime

We saw that oj disables and re-enables the GC around parse_json, that's quite a hack 😅 From a quick look RB_GC_GUARD(str) should be used instead, and there is probably no need to memcpy() the string to parse. Might need to pin the string to avoid the char* moving maybe, cc @tenderlove. Creating the data object might trigger GC which could explain the crash seen here (even if the GC is disabled, for instance if malloc() failed or so). Also the data->json field should be assigned in the struct before allocating the object, so it's not NULL if the finalizer or marking runs early

eregon avatar Nov 30 '22 21:11 eregon

You probably noticed that the default to halt GC is to not stop GC and it is optional. It was added quite a while ago to support libraries that don't guard.

The string is copied because at the time strings were not immutable. Even now depending on the way Ruby is run the string can still be modified unless I have missed some changes that are in some minor version changes.

Good point on the data->json assignment though, I think. Not sure if it can occur but better safe than sorry.

ohler55 avatar Nov 30 '22 22:11 ohler55

Correction. I think you were referring to the fast.c file. Different story on that one. I'd have to look up the issue on that one but I think it was a user reading a json document and then reading a second one which either modified the string or caused a GC during the copy. I didn't think that could happen since the string passed in was on the call stack but that seemed to fix it. RB_GC_GUARD didn't see to fix it but that was quite a while ago so things may have changed.

I'll spend a little time with fast.c to see if I can cleanup without introducing a crash.

ohler55 avatar Nov 30 '22 23:11 ohler55

Yes we were looking at fast.c

eregon avatar Dec 01 '22 21:12 eregon

Regarding the string changing, how about rb_str_dup() (or just calling dup)? That uses copy-on-write and so should be faster as it does avoid the copy unless the string is actually mutated.

eregon avatar Dec 01 '22 22:12 eregon

rb_str_dup() sound like the right answer. I still need to remove the GC disable and get a failing test so I can try the guard. This weekend I hope.

ohler55 avatar Dec 01 '22 23:12 ohler55

I was not able to get a failure with the gc_disable commented out. I might just go with the suggested changes and compare the performance difference. If you have a test that fails that you can share that would be helpful.

ohler55 avatar Dec 06 '22 00:12 ohler55

@ohler55 @eregon I've made 2 branches in my fork that have a failing test case for me, and a solution for me. See #830 for reference. I hope this helps figure out a solution.

copiousfreetime avatar Dec 11 '22 06:12 copiousfreetime