crystal
crystal copied to clipboard
Interpreter: Process#run broken with defunct child process
Start the repl, then run another shell command via Process, then it's frozen forever.
The child process is defunct.
$ cr i
Using compiled compiler at /home/chao/git/personal/crystal/.build/crystal
icr:1:0> status = Process.run("ls")
$ ps aux | grep ls
chao 185966 0.0 0.0 0 0 pts/4 Z+ 01:29 0:00 [ls] <defunct>
With the latest bug fixes, the interpreter could run all kemal tests successfully except the run_spec.cr file which creates new processes with Process.
@asterite @straight-shoota @sdogruyol
$ cr i spec/all_spec.cr
Using compiled compiler at /home/chao/git/personal/crystal/.build/crystal
....................................................................................................
Finished in 1.58 seconds
100 examples, 0 failures, 0 errors, 0 pending
real 0m3.263s
user 0m3.642s
sys 0m0.303s
Fork in interpreted mode is broken and I don't understand why yet. I debugged this a lot in the past. This will likely not be fixed any time soon.
Hello,
I'm not a unix expert, but I was looking a bit into this and on Darwin 19.6.0 x86_64, status = Process.run("ls") stops at LibC.execvp(command, argv) in Process.try_replace that is executed in the child process started by LibC.fork in Process.fork.
Just as an experiment I tried to comment out the line that executes the command (LibC.execvp(command, argv)) and I get an EINVAL error from the parent and the Process.run() call terminates trivially.
The input of LibC.execvp(command, argv) is:
command: "ls"argv: [Pointer(UInt8)@0x116920f6c, Pointer(UInt8).null]
Any idea why LibC.execvp(command, argv) doesn't return?
How would you continue debugging this issue?
It's because there's something wrong in the way fork is executed in the interpreter. And it's really hard to debug.
To begin with, we should start finding out why this happens:
pid = Crystal::System::Process.fork(will_exec: true)
if pid
puts "parent with child: #{pid}"
else
debugger
puts "child"
end
The output of running that through crystal i file.cr is:
[warn] kevent: Bad file descriptor
[warn] kevent: Bad file descriptor
[warn] kevent: Bad file descriptor
...
My guess is that we'll need to make the interpreter aware of fork, but I'm not sure.
@straight-shoota This is a very complex bug and I think it is worth tracking it in #11555.
@asterite I have a partial fix for this defunct process issue.
The signal handlers were skipped for interpreted mode.
But it only fixes part of the issue.
Try interpreting spec/std/process_spec.cr with and without this patch.
Somehow, the example code works with release build but not debug build.
diff --git a/src/kernel.cr b/src/kernel.cr
index 9f63d2ffc..0af75e9a7 100644
--- a/src/kernel.cr
+++ b/src/kernel.cr
@@ -523,7 +523,7 @@ end
end
{% end %}
-{% unless flag?(:interpreted) || flag?(:wasm32) %}
+{% unless flag?(:wasm32) %}
{% unless flag?(:win32) %}
# Background loop to cleanup unused fiber stacks.
spawn(name: "Fiber Clean Loop") do
@@ -536,6 +536,7 @@ end
Signal.setup_default_handlers
{% end %}
+ {% unless flag?(:interpreted) %}
# load debug info on start up of the program is executed with CRYSTAL_LOAD_DEBUG_INFO=1
# this will make debug info available on print_frame that is used by Crystal's segfault handler
#
@@ -544,6 +545,7 @@ end
# - Other values will load debug info on demand: when the backtrace of the first exception is generated
Exception::CallStack.load_debug_info if ENV["CRYSTAL_LOAD_DEBUG_INFO"]? == "1"
Exception::CallStack.setup_crash_handler
+ {% end %}
{% if flag?(:preview_mt) %}
Crystal::Scheduler.init_workers
With a release mode compiler built from my git branch, I could run all tests of kemal and lucky successfully with the interpreter.
@asterite @jwoertink @sdogruyol
$ cr -v
Using compiled compiler at .build/crystal
Crystal 1.5.0 [612e52b91] (2022-07-12)
LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu
OS: Ubuntu 22.04
Cool!
What happens if you run this code with the interpreter?
loop do
puts `ls`
end
I got different things after a few iterations, I got the enum error most of the times:
Invalid memory access (signal 11) at address 0x8
shift?|535|5|/home/chao/git/personal/crystal/src/deque.crunlock|122|18|/home/chao/git/personal/crystal/src/mutex.crsync|26|7|/home/chao/git/personal/crystal/src/crystal/spin_lock.crunlock|117|5|/home/chao/git/personal/crystal/src/mutex.crwait|334|7|/home/chao/git/personal/crystal/src/signal.crinitialize|11|16|/home/chao/git/personal/crystal/src/crystal/system/unix/process.crnew|10|3|/home/chao/git/personal/crystal/src/crystal/system/unix/process.crinitialize:shell:input:output:error|230|21|/home/chao/git/personal/crystal/src/process.crnew:shell:input:output:error|221|3|/home/chao/git/personal/crystal/src/process.cr`|436|13|/home/chao/git/personal/crystal/src/process.cr.crystal/bugs/loop.cr|2|8|.crystal/bugs/loop.crloop|103|5|/home/chao/git/personal/crystal/src/kernel.cr.crystal/bugs/loop.cr|1|1|.crystal/bugs/loop.cr
In: Hash(Int32, Int32)#fit_in_indices
Node: (1_u32 << @indices_size_pow2) - 1
Arithmetic overflow (OverflowError)
from src/hash.cr:522:32 in 'find_entry'
from src/hash.cr:1186:5 in 'solve'
from src/compiler/crystal/types.cr:1990:13 in 'replace_type_parameters'
from src/compiler/crystal/types.cr:1976:9 in 'parents'
from src/compiler/crystal/types.cr:385:7 in 'collect_ancestors'
from src/indexable.cr:558:11 in 'find_base_generic_instantiation'
from src/compiler/crystal/semantic/abstract_def_checker.cr:156:22 in 'implements?'
from src/compiler/crystal/semantic/abstract_def_checker.cr:96:20 in 'implements_with_ancestors?'
from src/compiler/crystal/semantic/abstract_def_checker.cr:77:15 in 'check_implemented_in_subtypes'
from src/compiler/crystal/semantic/abstract_def_checker.cr:53:13 in 'check_single'
from src/int.cr:540:7 in 'run'
from src/compiler/crystal/progress_tracker.cr:23:20 in 'semantic'
from src/gc/boehm.cr:129:5 in 'crystal_raise_overflow_compiled_def'
from src/compiler/crystal/interpreter/interpreter.cr:337:7 in 'interpret'
from src/compiler/crystal/interpreter/interpreter.cr:235:13 in 'interpret'
from src/compiler/crystal/interpreter/repl.cr:146:5 in 'interpret_and_exit_on_error'
from src/gc/boehm.cr:129:5 in 'repl'
from src/compiler/crystal/command.cr:100:7 in '__crystal_main'
from src/crystal/main.cr:115:5 in 'main'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /home/chao/git/personal/crystal/.build/crystal in '_start'
from ???
Unhandled case: enum value outside of defined enum members (Exception)
from src/compiler/crystal/interpreter/interpreter.cr:337:7 in 'interpret'
from src/compiler/crystal/interpreter/interpreter.cr:235:13 in 'interpret'
from src/compiler/crystal/interpreter/repl.cr:146:5 in 'interpret_and_exit_on_error'
from src/gc/boehm.cr:129:5 in 'repl'
from src/compiler/crystal/command.cr:100:7 in '__crystal_main'
from src/crystal/main.cr:115:5 in 'main'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /home/chao/git/personal/crystal/.build/crystal in '_start'
from ???
Unable to load dwarf information: Unhandled case: enum value outside of defined enum members (Exception)
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /lib/x86_64-linux-gnu/libffi.so.8 in '??'
from /lib/x86_64-linux-gnu/libffi.so.8 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in 'llseek'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in 'dl_iterate_phdr'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '??'
from /home/chao/git/personal/crystal/.build/crystal in '__crystal_main'
from /home/chao/git/personal/crystal/.build/crystal in 'main'
from /lib/x86_64-linux-gnu/libc.so.6 in '??'
from /lib/x86_64-linux-gnu/libc.so.6 in '__libc_start_main'
from /home/chao/git/personal/crystal/.build/crystal in '_start'
from ???
Sometimes it's frozen.
I spent some time digging into this issue, here's what I found:
- Process#wait calls Crystal::SignalChildHandler.wait(@pid)
- The interpreter itself has
Crystal::SignalChildHandlersetup and processing signals from child processes and I thinkCrystal::SignalChildHandleris in the compiled binary. - The interpreted code has access to a different
Crystal::SignalChildHandlerwhich I think is created by the interpreter and managed by the VM. - There are two
Crystal::SignalChildHandler, the one on binary has default handlers setup while the one on VM does not. That's why child processes created by interpreted code are defunct. Try printing out the hash ofCrystal::SignalChildHandlerfrom compiled and interpreted code, they are different. - Calling ::Signal.setup_default_handlers at the beginning of the interpreted code to set up handlers on VM partially works but it conflicts with the
Crystal::SignalChildHandlerfrom the binary since we have both of them processing signals from child processes thus crashing the interpreter
So we'll need:
- Ideally the interpreter should only have one
Crystal::SignalChildHandlerthat is shared between the binary and the VM managed code - Or only have the one on VM to handle signals from child processes - I have tried skipping setting up default handlers from the compiled binary, not working currently because the binary runs
lddon start up which will be defunct and the process freezes due to waiting for signals. I guess it's coming from shell expansion in link annotations - Or maybe you can attach multiple signal handlers for the same process? I'm not an expert on this.
I'm stopping here because I have no idea how to fix or work around this issue.
My findings might be wrong but I hope it could help other people to fix this issue if my findings are proved to be correct.
@asterite @straight-shoota
It seems like the GC is crashing the interpreter, below code works fine with compiler built from current master branch:
@asterite @straight-shoota , please try it out.
require "gc"
GC.disable
puts "loop.cr: #{Process.pid} Crystal::SignalChildHandler: #{Crystal::SignalChildHandler.hash}"
::Signal.setup_default_handlers
count = 0
loop do
puts count
puts `ls`
count = count + 1
end
System info:
$ cr -v
Using compiled compiler at .build/crystal
Crystal 1.6.0-dev [aab3574bf] (2022-07-13)
LLVM: 14.0.6
Default target: x86_64-pc-linux-gnu
OS: Ubuntu 22.04
The interpreter could run all tests from spec/std/process_spec.cr successfully after adding ::Signal.setup_default_handlers at the beginning of the file and disabling GC with GC_DONT_GC=1.
GC_DONT_GC=1 bin/crystal i spec/std/process_spec.cr
@asterite I think the debugger works fine with forking, you can try below code out with the interpreter:
Example code:
require "gc"
GC.disable
puts "loop.cr: #{Process.pid} Crystal::SignalChildHandler: #{Crystal::SignalChildHandler.hash}"
::Signal.setup_default_handlers
pid = Crystal::System::Process.fork(will_exec: true)
if pid
puts "#{Process.pid}: from parent start"
puts "#{Process.pid}: from parent stdin #{STDIN}"
puts "#{Process.pid}: from parent stdout #{STDOUT}"
puts "#{Process.pid}: child #{Crystal::SignalChildHandler.wait(pid).receive}"
debugger
puts "#{Process.pid}: from parent end"
else
puts "#{Process.pid}: from child start"
puts "#{Process.pid}: from child stdin #{STDIN}"
puts "#{Process.pid}: from child stdout #{STDOUT}"
debugger
puts "#{Process.pid}: from child end"
end
You need to block to wait in the parent, otherwise parent exits and the stdin and stdout would be closed. Interactive session log:
$ cr i .crystal/bugs/fork.cr
Using compiled compiler at .build/crystal
loop.cr: 12316 Crystal::SignalChildHandler: 11479924585718654121
12316: from parent start
12376: from child start
12316: from parent stdin #<IO::FileDescriptor:0x7f71cbe90800>
12376: from child stdin #<IO::FileDescriptor:0x7f71cbe90800>
12316: from parent stdout #<IO::FileDescriptor:0x7f71cbe90700>
12376: from child stdout #<IO::FileDescriptor:0x7f71cbe90700>
From: .crystal/bugs/fork.cr:19:8 <Program>#.crystal/bugs/fork.cr:
14: else
15: puts "#{Process.pid}: from child start"
16: puts "#{Process.pid}: from child stdin #{STDIN}"
17: puts "#{Process.pid}: from child stdout #{STDOUT}"
18: debugger
=> 19: puts "#{Process.pid}: from child end"
20: end
pry> Process.pid
12376
pry> continue
12376: from child end
12316: child 0
From: .crystal/bugs/fork.cr:13:8 <Program>#.crystal/bugs/fork.cr:
8: puts "#{Process.pid}: from parent start"
9: puts "#{Process.pid}: from parent stdin #{STDIN}"
10: puts "#{Process.pid}: from parent stdout #{STDOUT}"
11: puts "#{Process.pid}: child #{Crystal::SignalChildHandler.wait(pid).receive}"
12: debugger
=> 13: puts "#{Process.pid}: from parent end"
14: else
15: puts "#{Process.pid}: from child start"
16: puts "#{Process.pid}: from child stdin #{STDIN}"
17: puts "#{Process.pid}: from child stdout #{STDOUT}"
18: debugger
pry> Process.pid
12316
pry> continue
12316: from parent end
Thanks for the investigation!
I don't think I'll have time to look at this. But the gist is: if it works with the GC disabled, then it definitely doesn't work. It should work with the GC enabled. If it crashes with the GC on it means we aren't tracking memory properly. The problem with that is that it's really hard to know where we are doing something wrong! So it will require a lot of investigation and narrowing down things.
For now, I'll keep using my workaround for this issue.
And here is my current workaround in case someone needs it.
{% if flag?(:interpreted) %}
GC.disable
Signal::CHLD.reset
Crystal::Signal.start_loop
{% end %}
The problem isn't with fork or maybe it was but this has long been fixed.
What happens is that calling an external process (e.g. Process.run) eventually waits for the process to terminate but we can't use waitpid or another syscall that would block the current thread, so the crystal runtime waits for a SIGCHLD signal to keep other fibers going, and resume the pending fiber when the signal is received.
The problem is that the interpreted code will never receive the signal, because the crystal interpreter will be the one receiving it, but since it doesn't know anything about that child process, it likely discards it... and the interpreted code hangs forever.
I'm not sure how this can be fixed. I'm not sure the interpreted code taking over the signals is a good idea. Maybe the interpreter should forward signals to the interpreted code?