djdt-flamegraph icon indicating copy to clipboard operation
djdt-flamegraph copied to clipboard

Signals + subprocesses may lead to livelock

Open mgedmin opened this issue 9 years ago • 5 comments

When you're profiling a view that uses subprocess.Popen(), it might livelock:

  • the clone() syscall (used to implement os.fork() under the hood) may take 2-3 milliseconds on my machine (as measured by strace -T while the flamegraph was off)
  • a SIGALRM arrives every 1 ms and interrupts the clone()
  • clone() returns -EINTR
  • the signal handler runs
  • clone() is then restarted
  • go to step 1

This is very clearly visible in strace output. Externally this is visible as a Django process eating 100% CPU and sometimes not making any progress.

Sometimes the view would actually finish, after an extra 10-20 seconds of this processing, which shows in the flamegraph as a call stack terminating in subprocess.Popen._wait().

I don't know if this can be fixed.

Possible mitigations: don't use subprocess.Popen() from the main thread, use a lower sampling frequency for the SIGALRM timer.

mgedmin avatar Dec 02 '15 06:12 mgedmin

Does raising the sampling frequency to say, 5, remove the issue?

blopker avatar Dec 02 '15 21:12 blopker

I haven't actually tried that.

Background: I was trying to figure out why precisely django-pipeline caused an 800ms delay in my template render, and I hacked the django-pipeline source to do the subprocess invocation in the main thread instead of using concurrent.futures.ThreadPoolExecutor. That's when I stumbled upon this problem. It's a classical race condition: sometimes clone() finishes before the next SIGALRM arrives, and so I was able to see the flame graph I was interested in after a few retries (that required me to kill and restart the livelocked django server).

This issue is not causing me pain, since I reverted the debugging change to django-pipeline once I was done investigating. I only filed a bug because I thought the problem analysis might be useful to anyone else who happens to encounter this issue.

I expect raising the sampling interval to 5 ms would make this problem harder to hit, but not remove the underlying issue.

mgedmin avatar Dec 03 '15 07:12 mgedmin

Apparently you can ask that signals not interrupt system calls with signal.siginterrupt(sig, False).

Source: https://news.ycombinator.com/item?id=9243855

mgedmin avatar Jan 26 '16 13:01 mgedmin

Ah, cool thank you. Does it work for you?

Sorry about the lack of updates. It has been working for us and I've been really busy lately. I still have this on my TODO though :smile:

blopker avatar Jan 26 '16 18:01 blopker

I haven't had a chance to try it out yet.

(Also, oops, I see I promised some smaller PRs doing something about the sampling inaccuracies I mentioned in #1, and I haven't delivered yet. Been busy too :/ What's worse -- I completely forgot the ball was in my court!)

mgedmin avatar Jan 27 '16 06:01 mgedmin