povray
povray copied to clipboard
Unix mutex deadlocks due our signal handler set up.
This issue broken out from github issue #118. It relates to that issue only in that in high frame rate animations being hard to pause, I was rapidly hitting 'p' to pause causing our unix signal signal handling code to run more frequently.
After triggering now more than a half dozen deadlocked events, looking at stack traces and reading up, I believe our code is getting into an inter-thread mutex deadlock due a code set up similar to that warned against in the following link:
http://www.linuxquestions.org/questions/programming-9/locking-mutex-in-a-signal-handler-function-789218/
Screen grab of one such deadlocked event:
Once locked in this way(1) issuing a sigkill (perhaps sigstop?) to the POV-Ray process is the only way to exit via outside control. Attaching gdb to the deadlocked process, I have been able to get out of the deadlock by forcing a return from the SignalHandler function in unixconsole.cpp.
(1) The deadlock signatures do not always look the same. Once in a while the SDL screen is black, for example, which makes it similar to an old closed FS issue (http://bugs.povray.org/task/23) where the behaviour was attributed primarily to SDL1.2.
Getting into the deadlock is not easy, which will make verifying any proposed fix difficult. My current plan is to try and dramatically slow down the mutex locked code I think tangled in the deadlock so they happen more often. After which, I'll play with alternative signal handing set ups.
Note. I am not experienced with threads, mutex locks and such. Anyone adept at this stuff and able to offer guidance, please feel free to speak up.
Just noticed this thread. It seems to be someway related to the hang happening in #117 and the requirement of an animation rendering may be the root cause I am still unable to reproduce the hang on my pc, does it?
I don't think there's any connection. Both issues have the buzzword "mutex" in it, but I guess that's where the similarities end. A deadlock is something entirely different than a failure to clean up a mutex class.
@wfpokorny Do you happen to know whether this is still an issue or happens to have been fixed at some point?
The right question is probably whether it's still a hang which can be triggered by banging on the pause / un-pause p key enough...
I believe the structural issue is still there with the SDL1.2 or SDL2.0(povr,hgpovray38 only) previews.
The hang was always 'hard' to trigger and my attempt to hack code to get something that would hang more reliably, failed.
Aside: I've been using an X11 preview window as the default in the povr branch. To get the X11 preview to work reliably over what Jerome first coded up, I implemented what an X11 crash message suggested as a fix. That fix automatically adds some thread locking around the X11 calls IIRC. I've not seen pause / un-pause hangs with the povr branch - but I only somewhat routinely do this when playing with rtr. Long winded way to say an X11 preview with that fix might be immune to the hangs.
I guess I could quickly trying rtr in povr with SDL2.0 while banging on the 'p' key....
Well in 5 minutes of rapid pause / un-pause with a couple different rtr renders at frame rates of 3-4/sec and 20/sec respectively - no hangs - maybe SDL2.0 is better for this hang too...?
3.8.0 still uses SDL1.2 though where this and other issues occasionally seen.
Ya know. I say we just close this issue!
We aren't going to try and fix anything like this for v3.8 (SDL1.2).
And v4.0 should move to SDL2.0, if not also adopt an X11 preview option. Until a hang can somewhat reliably be shown in v4.0 with SDL2.0 or X11 this issue isn't an issue - practically.
I agree re v3.8.
As for v4.0 however, I'm not entirely happy with closing this until we actually have fixed it (or at least have reason to believe that we have) by moving to SDL 2.0 and/or X11. So I'll add this to the v4.0 project, if only to remind us not to be happy until we've got SDL 1.2 ditched for good.
FWIW. Worked up a patch for this. In my povr fork it seems to fix (or avoid) the inter-thread deadlocks on pauses. The deadlocking issue was found to be much worse (making debug easier) when I started running real time raytracting (RTR) under each frame of a more traditional animation. (Creating an all POV-Ray simple modeling set up for lathes and such)
The basic deadlock issues are that pause and resume state change requests were both stacking up - and often bleeding into the time after the pure (non-RTR frames) frame ends. A time where there is a complete shutdown sequence ahead of the next animation frame or program exit.
Two basic changes were made. The first some tracking of events and actual state changes to make it less likely stacked pause-resume-pause-resume state changes build up - new requests are generated only after the tracking state has changed. The second, more important change was the creation of a gPureFrameActive variable and associated mutex which turns false once the first render thread finds itself with no more blocks to render. Once false that variable prevents the linux x11l/SDL2 preview window handle event code from running and generating pause / resume requests which cause the inter-thread deadlocks. Expect other patch approaches possible.
My povr branch for the involved code is quite a bit different that official POV-Ray sources at this point, but I expect the basic problem is the same - or nearly so - in official releases. So, it should be some similar fix to the approach I took to patch this issue should fly.
For a v4.0 (master) branch, I think it would be worth looking anew at how the states, threads and locks are handled. Perhaps better can be done as I think odd thread issues are still possible. There are new C++17 mutex / lock features, for example.