py-spy feature: on demand `sudo`-like behavior w/o `sudo` for hanging processes

Splitting of from https://github.com/benfred/py-spy/issues/395, some envs give no sudo access to its users (e.g. HPCs), and the need to be able to dump the trace on demand is big.

The relevant discussion starts here: https://github.com/benfred/py-spy/issues/395#issuecomment-842582958

So we were discussing whether to setup a signahdler or key handler (preferred) and perhaps on top of top with customizable refresh rate - e.g. set it to 1h, so basically it's there but don't nothing to interfere with or slow down the main program.

So the proposal is really to add a sort of daemon mode where it doesn't nothing other than polling on a key handler and removes the sudo limitation.

For example, if I run a SLURM job and a process on some node starts hanging I want to be able to get a traceback.

As I mentioned earlier I tried using faulthandler but I couldn't make it work with sighandler and also half the time it spits garbled output when I used its faulthandler.dump_traceback_later feature. So far py-spy outshines all other tools in its utility. If only it weren't for a sudo requirement.

Thank you!

Jun 24 '21 22:06 stas00

btw, this whole sudo-requirement seems to be a 5.x linux kernel thing. On one HPC I had no problem attaching w/o sudo, but discovered it was 4.x kernel!

Jun 26 '21 03:06 stas00

Attaching with no root privs depends on the ptrace settings (most likely: https://www.kernel.org/doc/Documentation/security/Yama.txt). You should check its value on different nodes. Basically, value of 0 will allow you to run py-spy on your own processes (same UID+GID); a value of 1 will allow you to py-spy programs started by py-spy; 2 will require root and 3 will deny ptracing at all,

I think that in your case, faulthandler is the correct approach. Alternatively, simply attaching a signal handler to e.g SIGUSR2 and "manually" dumping the traceback of all threads could also work. If signals are being reused by the app, you could start a thread that calls faulthandler.dump_traceback() to a specific file every X minutes?

The py-spy based approach, IMO, would be to allow py-spy to be linked as a CPython extension so it can be imported into your Python app, then it could operate on the "same" memory space and avoid the need for ptracing. This could be useful also for self-profiling (but for dumping, I think that faulthandler covers whatever you need, and if not - you can implement it in Python with greater ease than in py-spy). See a related issue in rbspy where they discuss an in-process API for controlling the profiler; this API could be then exposed to your Ruby/Python program. It's also been implemented in py-spy, just needs to be exposed to Python :)

Jun 26 '21 21:06 Jongy

Thank you for the ptrace setting insight, @Jongy! Now I understand why the sudo was needed and that it had nothing to do with the kernel version!

The problem with faulthandler is that the output I get is garbled half the time (some sort of rot13-like output).

Jun 27 '21 21:06 stas00

Thank you for the ptrace setting insight, @Jongy! Now I understand why the sudo was needed and that it had nothing to do with the kernel version!

Sure thing :)

The problem with faulthandler is that the output I get is garbled half the time (some sort of rot13-like output).

Oh, wow. Well, I'm not too acquainted with faulthandler, but I assume it's implementing simple iteration over threads, printing traces. It's very easy to implement in Python, see this SO answer for example: https://stackoverflow.com/a/7317379/797390. In your case I'd try a thread that dumps all threads' stacks to a file, every set interval.

Jun 27 '21 22:06 Jongy