djdt-flamegraph
djdt-flamegraph copied to clipboard
Signals + subprocesses may lead to livelock
When you're profiling a view that uses subprocess.Popen()
, it might livelock:
- the
clone()
syscall (used to implementos.fork()
under the hood) may take 2-3 milliseconds on my machine (as measured bystrace -T
while the flamegraph was off) - a SIGALRM arrives every 1 ms and interrupts the
clone()
-
clone()
returns -EINTR - the signal handler runs
-
clone()
is then restarted - go to step 1
This is very clearly visible in strace output. Externally this is visible as a Django process eating 100% CPU and sometimes not making any progress.
Sometimes the view would actually finish, after an extra 10-20 seconds of this processing, which shows in the flamegraph as a call stack terminating in subprocess.Popen._wait()
.
I don't know if this can be fixed.
Possible mitigations: don't use subprocess.Popen()
from the main thread, use a lower sampling frequency for the SIGALRM timer.
Does raising the sampling frequency to say, 5, remove the issue?
I haven't actually tried that.
Background: I was trying to figure out why precisely django-pipeline caused an 800ms delay in my template render, and I hacked the django-pipeline source to do the subprocess invocation in the main thread instead of using concurrent.futures.ThreadPoolExecutor. That's when I stumbled upon this problem. It's a classical race condition: sometimes clone() finishes before the next SIGALRM arrives, and so I was able to see the flame graph I was interested in after a few retries (that required me to kill and restart the livelocked django server).
This issue is not causing me pain, since I reverted the debugging change to django-pipeline once I was done investigating. I only filed a bug because I thought the problem analysis might be useful to anyone else who happens to encounter this issue.
I expect raising the sampling interval to 5 ms would make this problem harder to hit, but not remove the underlying issue.
Apparently you can ask that signals not interrupt system calls with signal.siginterrupt(sig, False)
.
Source: https://news.ycombinator.com/item?id=9243855
Ah, cool thank you. Does it work for you?
Sorry about the lack of updates. It has been working for us and I've been really busy lately. I still have this on my TODO though :smile:
I haven't had a chance to try it out yet.
(Also, oops, I see I promised some smaller PRs doing something about the sampling inaccuracies I mentioned in #1, and I haven't delivered yet. Been busy too :/ What's worse -- I completely forgot the ball was in my court!)