flux-core
flux-core copied to clipboard
Starting flux job under a debugger fails to hold the job for the debugger to attach
There is a certain handshake sequence that needs to occur between a debugger and flux when trying to debug an MPI job. This is defined in the paper "The MPIR Process Acquisition Interface" which I will attach here. This process was working with an earlier version of flux ( commands: 0.31.0, libflux-core: 0.31.0 build options: +hwloc==2.5.0+zmq==4.3.4) but fails to hold the job on the current version I"m testing, (commands: 0.42.0 libflux-core: 0.42.0. libflux-security: 0.7.0 build-options: +hwloc==2.8.0 +zmq==4.3.4)
As a result, the debugger either fails to attach entirely, because the job is short and has already run, or it does attach to the spawned processes, but after they have been running for some time. Normally one expects the job to be in the MPIR_breakpoint routine which is upstream from the MPI_Init call. This has been tested with the TotalView debugger using the command
totalview -args flux mini run -N 2 -n 4 ./mpi_test_program
If you add the --debug option, things work as expected,
totalview -args flux mini run --debug -N 2 -n 4 ./mpi_test_program
But this should not be needed.
The steps that appear to be taking place as reported are
- Each task calls ptrace (PTRACE_TRACEME)?? before exec()
- parent (the job shell) calls waitpid(2) and waits for tasks to stop (WIFSTOPPED(status))
- parent sends SIGSTOP and then detaches from task (ptrace(PTRACE_DETACH))
I've passed this issue by the author of the attached MPIR Acquisition paper and he thinks the last step is wrong, but should be
- parent detaches from task with a SIGSTOP signal: (ptrace(PTRACE_DETACH, pid, 0, SIGSTOP))
parent detaches from task with a SIGSTOP signal: (ptrace(PTRACE_DETACH, pid, 0, SIGSTOP))
Ok, thanks! This is something to try. I verified in my testing it works the same as first sending SIGSTOP then detaching. Do you want to try this patch with Totalview?
diff --git a/src/shell/mpir/ptrace.c b/src/shell/mpir/ptrace.c
index 4d53114ea..cdd35bb21 100644
--- a/src/shell/mpir/ptrace.c
+++ b/src/shell/mpir/ptrace.c
@@ -74,11 +74,9 @@ static int ptrace_stop_task (flux_plugin_t *p,
return shell_log_errno ("waitpid");
shell_trace ("stop_task: waitpid returned status 0x%04x", status);
if (WIFSTOPPED (status)) {
- /* Send SIGSTOP, then detach from process */
- if (kill (pid, SIGSTOP) < 0)
- return shell_log_errno ("debug_trace: kill");
+ /* detach from process with SIGSTOP */
shell_trace ("stop_task: detaching from pid %ld", (long) pid);
- if (ptrace (PTRACE_DETACH, pid, NULL, 0) < 0)
+ if (ptrace (PTRACE_DETACH, pid, NULL, SIGSTOP) < 0)
return shell_log_errno ("debug_trace: ptrace");
return 0;
}
Oh, since totalview works when the --debug option is used, I'm actually doubtful that the change above will make any difference. There must be a problem with the stop-tasks-in-exec job shell option being propagated.
I was able to try
$ totalview -args flux mini run -N 2 -n 4 ./mpi_test_program
On one of our systems and verified that the problem here is that the stop-tasks-in-exec shell option is not being set by flux-mini.py. This is why the remote tasks do not get stopped by PTRACE_TRACEME, because the this code is not being activated.
I added debugging to the python script to print debugged.get_mpir_being_debugged() before the check, and in my test it reports that this value is not set in the process address space:
diff --git a/src/cmd/flux-mini.py b/src/cmd/flux-mini.py
index 08acd85f7..fd2cace14 100755
--- a/src/cmd/flux-mini.py
+++ b/src/cmd/flux-mini.py
@@ -640,6 +640,8 @@ class MiniCmd:
if args.debug_emulate:
debugged.set_mpir_being_debugged(1)
+ print ("mpir_being_debugged = ", debugged.get_mpir_being_debugged(),
+ file=sys.stderr)
if debugged.get_mpir_being_debugged() == 1:
# if stop-tasks-in-exec is present, overwrite
jobspec.setattr_shell_option("stop-tasks-in-exec", json.loads("1"))
$ /usr/global/tools/totalview/release/bin/totalview -args flux mini run sleep 0
mpir_being_debugged = 0
flux-job: cannot debug job that has finished running
So this answers why the tasks are not being stopped. However, I don't know how Totalview attempts to set MPIR_being_debugged in the address space of the process nor why it isn't working here. I'll try to see what other debugging I can do, but if you have any suggestions please let me know. From flux-mini.py perspective, it seems as if the debugger has not set this symbol to 1.
Could the problem here be that the way flux mini run works is
- The
fluxcommand driver is executed, which searches forflux-miniorflux-mini.pyinFLUX_EXEC_PATH - The
fluxexecutable findsflux-mini.pyand invokespython flux-mini.py(This is whereMPIR_being_debuggedneeds to be set to1 flux-mini.pysubmits job, then execsflux job attach(this is after the job is submitted, sostop-tasks-in-execshould be already be set, step 2 should be the important bit)
Actually, I don't understand how this can work from Python. The MPIR_being_debugged symbol is in the flux.debugged module. The module itself is not loaded until well after python executable's main() during processing of the import statements in flux-mini.py. I wonder if when Totalview tries to set MPIR_being_debugged the symbol does not yet exist?
Presumably this worked at the time Dong implemented it, but I'm afraid I don't know how.
I believe TotalView is setting this early on, when it detects this is a parallel starter process. There is a debug flag for TotalView which dumps out info about the process going parallel, and it loops through the check for a parallel process a number of times. I see a difference in the output when run on rzvernal with the 0.31.0 flux vs the check on corona, but the output doesn't actually record when MPIR_being_debugged is set to 1. It does this after checking for symbols that would indicate this is an MPI starter, such as MPIR_breakpoint, and other MPIR symbols.
But I see John D. has already answered a bit more completely, as I would expect ;-) It wouldn't be hard to put in some output at that point to find out exactly when this happens.
PeterT
On 9/8/22 12:28, Mark Grondona wrote:
Actually, I don't understand how this can work from Python. The |MPIR_being_debugged| symbol is in the |flux.debugged| module. The module itself is not loaded until well after |python| executable's |main()| during processing of the |import| statements in |flux-mini.py|. I wonder if when Totalview tries to set |MPIR_being_debugged| the symbol does not yet exist?
Presumably this worked at the time Dong implemented it, but I'm afraid I don't know how.
— Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-core/issues/4553#issuecomment-1240946069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJGBJO4NQCBV6JN26FVVTWLV5IHZ5ANCNFSM6AAAAAAQH24P5Y. You are receiving this because you authored the thread.Message ID: @.***>
-- Peter Thompson | Principal Technical Support Engineer
This e-mail may contain information that is privileged or confidential. If you are not the intended recipient, please delete the e-mail and any attachments and notify us immediately.
This turns out to be a TotalVIew issue after all. Or an interaction between flux and totalview that can be corrected on the TotalVIew side. One of the methods we can use to speed up start up time under TotalView is to set the option
-no_dlopen_always_recalculate. This does not update everything when a dlopen occurs, which happens a lot in OpenMPI and various offshots. This was set to false to help speed performance in large code projects. However this was also failing to update the needed flag when running under flux. The reason that it was working for me, but not Mark depended on where I was running flux, and where I was invoking TotalView TotalView from. If I invoked it from /usr/global/tools, which had the recalculate on dlopen set to false, the MPIR_being_debugged flag was not updated to 1. It was working for me on rzvernal, not due to some older version of flux, but because I was running TotalVIew from a spack install, which kept the default value of always recalculate on dlopen as true. We can keep some of the speed up by keeping
TV::dlopen_always_recalculate false
and setting
TV::dlopen_recalculate_on_match {*/_flux/_hostlist.so}
That matches one of the dll's which contains MPIR_being_debugged, and allows the program to halt until the debugger attaches.
No extra work is required from flux unless the libraries where MPIR_being_debugged is being exposed should change.
Thanks @petertea!