crystal icon indicating copy to clipboard operation
crystal copied to clipboard

Interpreter: Process#run broken with defunct child process

Open cyangle opened this issue 3 years ago • 20 comments

Start the repl, then run another shell command via Process, then it's frozen forever.

The child process is defunct.

$ cr i
Using compiled compiler at /home/chao/git/personal/crystal/.build/crystal
icr:1:0> status = Process.run("ls")
$ ps aux | grep ls
chao      185966  0.0  0.0      0     0 pts/4    Z+   01:29   0:00 [ls] <defunct>

cyangle avatar Jul 10 '22 01:07 cyangle

With the latest bug fixes, the interpreter could run all kemal tests successfully except the run_spec.cr file which creates new processes with Process.

@asterite @straight-shoota @sdogruyol

$ cr i spec/all_spec.cr 
Using compiled compiler at /home/chao/git/personal/crystal/.build/crystal
....................................................................................................

Finished in 1.58 seconds
100 examples, 0 failures, 0 errors, 0 pending

real    0m3.263s
user    0m3.642s
sys     0m0.303s

cyangle avatar Jul 10 '22 01:07 cyangle

Fork in interpreted mode is broken and I don't understand why yet. I debugged this a lot in the past. This will likely not be fixed any time soon.

asterite avatar Jul 10 '22 09:07 asterite

Hello,

I'm not a unix expert, but I was looking a bit into this and on Darwin 19.6.0 x86_64, status = Process.run("ls") stops at LibC.execvp(command, argv) in Process.try_replace that is executed in the child process started by LibC.fork in Process.fork.

Just as an experiment I tried to comment out the line that executes the command (LibC.execvp(command, argv)) and I get an EINVAL error from the parent and the Process.run() call terminates trivially.

The input of LibC.execvp(command, argv) is:

  • command: "ls"
  • argv: [Pointer(UInt8)@0x116920f6c, Pointer(UInt8).null]

Any idea why LibC.execvp(command, argv) doesn't return?

How would you continue debugging this issue?

potomak avatar Jul 10 '22 18:07 potomak

It's because there's something wrong in the way fork is executed in the interpreter. And it's really hard to debug.

To begin with, we should start finding out why this happens:

pid = Crystal::System::Process.fork(will_exec: true)
if pid
  puts "parent with child: #{pid}"
else
  debugger
  puts "child"
end

The output of running that through crystal i file.cr is:

[warn] kevent: Bad file descriptor
[warn] kevent: Bad file descriptor
[warn] kevent: Bad file descriptor
...

My guess is that we'll need to make the interpreter aware of fork, but I'm not sure.

asterite avatar Jul 10 '22 19:07 asterite

@straight-shoota This is a very complex bug and I think it is worth tracking it in #11555.

cyangle avatar Jul 11 '22 22:07 cyangle

@asterite I have a partial fix for this defunct process issue.

The signal handlers were skipped for interpreted mode.

But it only fixes part of the issue.

Try interpreting spec/std/process_spec.cr with and without this patch.

Somehow, the example code works with release build but not debug build.

diff --git a/src/kernel.cr b/src/kernel.cr
index 9f63d2ffc..0af75e9a7 100644
--- a/src/kernel.cr
+++ b/src/kernel.cr
@@ -523,7 +523,7 @@ end
   end
 {% end %}
 
-{% unless flag?(:interpreted) || flag?(:wasm32) %}
+{% unless flag?(:wasm32) %}
   {% unless flag?(:win32) %}
     # Background loop to cleanup unused fiber stacks.
     spawn(name: "Fiber Clean Loop") do
@@ -536,6 +536,7 @@ end
     Signal.setup_default_handlers
   {% end %}
 
+  {% unless flag?(:interpreted) %}
   # load debug info on start up of the program is executed with CRYSTAL_LOAD_DEBUG_INFO=1
   # this will make debug info available on print_frame that is used by Crystal's segfault handler
   #
@@ -544,6 +545,7 @@ end
   # - Other values will load debug info on demand: when the backtrace of the first exception is generated
   Exception::CallStack.load_debug_info if ENV["CRYSTAL_LOAD_DEBUG_INFO"]? == "1"
   Exception::CallStack.setup_crash_handler
+  {% end %}
 
   {% if flag?(:preview_mt) %}
     Crystal::Scheduler.init_workers

cyangle avatar Jul 12 '22 01:07 cyangle

With a release mode compiler built from my git branch, I could run all tests of kemal and lucky successfully with the interpreter.

@asterite @jwoertink @sdogruyol

$ cr -v
Using compiled compiler at .build/crystal
Crystal 1.5.0 [612e52b91] (2022-07-12)

LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu

OS: Ubuntu 22.04

cyangle avatar Jul 12 '22 04:07 cyangle

Cool!

What happens if you run this code with the interpreter?

loop do
  puts `ls`
end

asterite avatar Jul 12 '22 12:07 asterite

I got different things after a few iterations, I got the enum error most of the times:

Invalid memory access (signal 11) at address 0x8
shift?|535|5|/home/chao/git/personal/crystal/src/deque.crunlock|122|18|/home/chao/git/personal/crystal/src/mutex.crsync|26|7|/home/chao/git/personal/crystal/src/crystal/spin_lock.crunlock|117|5|/home/chao/git/personal/crystal/src/mutex.crwait|334|7|/home/chao/git/personal/crystal/src/signal.crinitialize|11|16|/home/chao/git/personal/crystal/src/crystal/system/unix/process.crnew|10|3|/home/chao/git/personal/crystal/src/crystal/system/unix/process.crinitialize:shell:input:output:error|230|21|/home/chao/git/personal/crystal/src/process.crnew:shell:input:output:error|221|3|/home/chao/git/personal/crystal/src/process.cr`|436|13|/home/chao/git/personal/crystal/src/process.cr.crystal/bugs/loop.cr|2|8|.crystal/bugs/loop.crloop|103|5|/home/chao/git/personal/crystal/src/kernel.cr.crystal/bugs/loop.cr|1|1|.crystal/bugs/loop.cr
In: Hash(Int32, Int32)#fit_in_indices
Node: (1_u32 << @indices_size_pow2) - 1
Arithmetic overflow (OverflowError)
  from src/hash.cr:522:32 in 'find_entry'
  from src/hash.cr:1186:5 in 'solve'
  from src/compiler/crystal/types.cr:1990:13 in 'replace_type_parameters'
  from src/compiler/crystal/types.cr:1976:9 in 'parents'
  from src/compiler/crystal/types.cr:385:7 in 'collect_ancestors'
  from src/indexable.cr:558:11 in 'find_base_generic_instantiation'
  from src/compiler/crystal/semantic/abstract_def_checker.cr:156:22 in 'implements?'
  from src/compiler/crystal/semantic/abstract_def_checker.cr:96:20 in 'implements_with_ancestors?'
  from src/compiler/crystal/semantic/abstract_def_checker.cr:77:15 in 'check_implemented_in_subtypes'
  from src/compiler/crystal/semantic/abstract_def_checker.cr:53:13 in 'check_single'
  from src/int.cr:540:7 in 'run'
  from src/compiler/crystal/progress_tracker.cr:23:20 in 'semantic'
  from src/gc/boehm.cr:129:5 in 'crystal_raise_overflow_compiled_def'
  from src/compiler/crystal/interpreter/interpreter.cr:337:7 in 'interpret'
  from src/compiler/crystal/interpreter/interpreter.cr:235:13 in 'interpret'
  from src/compiler/crystal/interpreter/repl.cr:146:5 in 'interpret_and_exit_on_error'
  from src/gc/boehm.cr:129:5 in 'repl'
  from src/compiler/crystal/command.cr:100:7 in '__crystal_main'
  from src/crystal/main.cr:115:5 in 'main'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
  from /home/chao/git/personal/crystal/.build/crystal in '_start'
  from ???
Unhandled case: enum value outside of defined enum members (Exception)
  from src/compiler/crystal/interpreter/interpreter.cr:337:7 in 'interpret'
  from src/compiler/crystal/interpreter/interpreter.cr:235:13 in 'interpret'
  from src/compiler/crystal/interpreter/repl.cr:146:5 in 'interpret_and_exit_on_error'
  from src/gc/boehm.cr:129:5 in 'repl'
  from src/compiler/crystal/command.cr:100:7 in '__crystal_main'
  from src/crystal/main.cr:115:5 in 'main'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
  from /home/chao/git/personal/crystal/.build/crystal in '_start'
  from ???

Unable to load dwarf information: Unhandled case: enum value outside of defined enum members (Exception)
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /lib/x86_64-linux-gnu/libffi.so.8 in '??'
  from /lib/x86_64-linux-gnu/libffi.so.8 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in 'llseek'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in 'dl_iterate_phdr'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '??'
  from /home/chao/git/personal/crystal/.build/crystal in '__crystal_main'
  from /home/chao/git/personal/crystal/.build/crystal in 'main'
  from /lib/x86_64-linux-gnu/libc.so.6 in '??'
  from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
  from /home/chao/git/personal/crystal/.build/crystal in '_start'
  from ???

Sometimes it's frozen.

cyangle avatar Jul 12 '22 14:07 cyangle

I spent some time digging into this issue, here's what I found:

  • Process#wait calls Crystal::SignalChildHandler.wait(@pid)
  • The interpreter itself has Crystal::SignalChildHandler setup and processing signals from child processes and I think Crystal::SignalChildHandler is in the compiled binary.
  • The interpreted code has access to a different Crystal::SignalChildHandler which I think is created by the interpreter and managed by the VM.
  • There are two Crystal::SignalChildHandler, the one on binary has default handlers setup while the one on VM does not. That's why child processes created by interpreted code are defunct. Try printing out the hash of Crystal::SignalChildHandler from compiled and interpreted code, they are different.
  • Calling ::Signal.setup_default_handlers at the beginning of the interpreted code to set up handlers on VM partially works but it conflicts with the Crystal::SignalChildHandler from the binary since we have both of them processing signals from child processes thus crashing the interpreter

So we'll need:

  • Ideally the interpreter should only have one Crystal::SignalChildHandler that is shared between the binary and the VM managed code
  • Or only have the one on VM to handle signals from child processes - I have tried skipping setting up default handlers from the compiled binary, not working currently because the binary runs ldd on start up which will be defunct and the process freezes due to waiting for signals. I guess it's coming from shell expansion in link annotations
  • Or maybe you can attach multiple signal handlers for the same process? I'm not an expert on this.

I'm stopping here because I have no idea how to fix or work around this issue.

My findings might be wrong but I hope it could help other people to fix this issue if my findings are proved to be correct.

@asterite @straight-shoota

cyangle avatar Jul 13 '22 23:07 cyangle

It seems like the GC is crashing the interpreter, below code works fine with compiler built from current master branch:

@asterite @straight-shoota , please try it out.

require "gc"
GC.disable
puts "loop.cr: #{Process.pid} Crystal::SignalChildHandler: #{Crystal::SignalChildHandler.hash}"
::Signal.setup_default_handlers
count = 0
loop do
  puts count
  puts `ls`
  count = count + 1
end

System info:

$ cr -v
Using compiled compiler at .build/crystal
Crystal 1.6.0-dev [aab3574bf] (2022-07-13)

LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu

OS: Ubuntu 22.04

cyangle avatar Jul 14 '22 02:07 cyangle

The interpreter could run all tests from spec/std/process_spec.cr successfully after adding ::Signal.setup_default_handlers at the beginning of the file and disabling GC with GC_DONT_GC=1.

GC_DONT_GC=1 bin/crystal i spec/std/process_spec.cr

cyangle avatar Jul 14 '22 05:07 cyangle

@asterite I think the debugger works fine with forking, you can try below code out with the interpreter:

Example code:

require "gc"
GC.disable
puts "loop.cr: #{Process.pid} Crystal::SignalChildHandler: #{Crystal::SignalChildHandler.hash}"
::Signal.setup_default_handlers

pid = Crystal::System::Process.fork(will_exec: true)
if pid
  puts "#{Process.pid}: from parent start"
  puts "#{Process.pid}: from parent stdin #{STDIN}"
  puts "#{Process.pid}: from parent stdout #{STDOUT}"
  puts "#{Process.pid}: child #{Crystal::SignalChildHandler.wait(pid).receive}"
  debugger
  puts "#{Process.pid}: from parent end"
else
  puts "#{Process.pid}: from child start"
  puts "#{Process.pid}: from child stdin #{STDIN}"
  puts "#{Process.pid}: from child stdout #{STDOUT}"
  debugger
  puts "#{Process.pid}: from child end"
end

You need to block to wait in the parent, otherwise parent exits and the stdin and stdout would be closed. Interactive session log:

$ cr i .crystal/bugs/fork.cr 
Using compiled compiler at .build/crystal
loop.cr: 12316 Crystal::SignalChildHandler: 11479924585718654121
12316: from parent start
12376: from child start
12316: from parent stdin #<IO::FileDescriptor:0x7f71cbe90800>
12376: from child stdin #<IO::FileDescriptor:0x7f71cbe90800>
12316: from parent stdout #<IO::FileDescriptor:0x7f71cbe90700>
12376: from child stdout #<IO::FileDescriptor:0x7f71cbe90700>
From: .crystal/bugs/fork.cr:19:8 <Program>#.crystal/bugs/fork.cr:

    14: else
    15:   puts "#{Process.pid}: from child start"
    16:   puts "#{Process.pid}: from child stdin #{STDIN}"
    17:   puts "#{Process.pid}: from child stdout #{STDOUT}"
    18:   debugger
 => 19:   puts "#{Process.pid}: from child end"
    20: end

pry> Process.pid
12376
pry> continue
12376: from child end
12316: child 0
From: .crystal/bugs/fork.cr:13:8 <Program>#.crystal/bugs/fork.cr:

     8:   puts "#{Process.pid}: from parent start"
     9:   puts "#{Process.pid}: from parent stdin #{STDIN}"
    10:   puts "#{Process.pid}: from parent stdout #{STDOUT}"
    11:   puts "#{Process.pid}: child #{Crystal::SignalChildHandler.wait(pid).receive}"
    12:   debugger
 => 13:   puts "#{Process.pid}: from parent end"
    14: else
    15:   puts "#{Process.pid}: from child start"
    16:   puts "#{Process.pid}: from child stdin #{STDIN}"
    17:   puts "#{Process.pid}: from child stdout #{STDOUT}"
    18:   debugger

pry> Process.pid
12316
pry> continue
12316: from parent end

cyangle avatar Jul 14 '22 22:07 cyangle

Thanks for the investigation!

I don't think I'll have time to look at this. But the gist is: if it works with the GC disabled, then it definitely doesn't work. It should work with the GC enabled. If it crashes with the GC on it means we aren't tracking memory properly. The problem with that is that it's really hard to know where we are doing something wrong! So it will require a lot of investigation and narrowing down things.

asterite avatar Jul 15 '22 11:07 asterite

For now, I'll keep using my workaround for this issue.

And here is my current workaround in case someone needs it.

{% if flag?(:interpreted) %}
  GC.disable
  Signal::CHLD.reset
  Crystal::Signal.start_loop
{% end %}

cyangle avatar Jul 17 '22 17:07 cyangle

The problem isn't with fork or maybe it was but this has long been fixed.

What happens is that calling an external process (e.g. Process.run) eventually waits for the process to terminate but we can't use waitpid or another syscall that would block the current thread, so the crystal runtime waits for a SIGCHLD signal to keep other fibers going, and resume the pending fiber when the signal is received.

The problem is that the interpreted code will never receive the signal, because the crystal interpreter will be the one receiving it, but since it doesn't know anything about that child process, it likely discards it... and the interpreted code hangs forever.

I'm not sure how this can be fixed. I'm not sure the interpreted code taking over the signals is a good idea. Maybe the interpreter should forward signals to the interpreted code?

ysbaddaden avatar Dec 04 '23 11:12 ysbaddaden