gproc
gproc copied to clipboard
gproc_pool:claim doesn't return worker when calling process is killed
I've found gproc_pool very flexible vs other pool libraries, but found this behaviour
48> gproc_pool:claim(pool, fun(_, P) -> P end).
{true,<0.185.0>}
49> Pid = spawn_link(fun() -> gproc_pool:claim(pool, fun(_, P) -> timer:sleep(10000), P en
d) end).
<0.223.0>
50> exit(Pid, kill).
** exception exit: killed
51> gproc_pool:claim(pool, fun(_, P) -> P end).
false
69> gproc_pool:active_workers(pool).
[{a,<0.185.0>}]
After killing the process, worker stay busy forever. I look through the code and found that it is expected behaviour. Am I right? Are there any plans to improve it?
It's true that the code doesn't handle the case of a process inside a claim operation gets killed from the outside. Strictly speaking, this is a bug.
Fixing it will cause a general slowdown, but I think I found a way that is still cheap enough. See PR #110. Let me know what you think.
Looks good and simple, can't find disadvantages
Describing it to my son, I came up with some failure scenarios that seem very hard to address.
Basically, the approach is inherently vulnerable to the calling process being murdered while running inside the claim. Starting from the (correct) assumption that the process could be killed after any given call, at least the following scenarios, however unlikely, come to mind:
- Line 509, just after getting the lock, but before spawning the monitoring process. A solution could be to start monitoring before trying the lock, but this seems to forfeit the purpose of the approach.
- Line 551, if the calling process is killed just after killing the monitoring process, but before resetting the counter, the counter will never be reset.
- If lines 551 & 552 are switched, and the calling process resets the counter then dies before killing the monitoring process, the monitor will reset the counter again, possibly after another process has claimed it.
Considering the above, one could either conclude that the claim()
operation is broken and shouldn't be used, or decide to live with the probability that it sometimes breaks, and then decide which types of breakage are least desirable (I'd wager that the counter never getting reset is the most serious).
Alternatively, figure out a different approach that's safe and not so much slower that the function becomes unfit for its purpose.
(This would be a good time to play around with QuickCheck/Pulse ...)
Since it certainly is an improvement over the previous implementation, I decided to merge. We can try to eliminate the (very slight) holes in the algorithm later.
Line 551, if the calling process is killed just after killing the monitoring process, but before resetting the counter, the counter will never be reset.
May be it's better in place of kill + reset_counter send a message to spawned process.
execute_claim(F, K, Pid) ->
Parent = self(),
Mon = spawn(
fun() ->
Ref = erlang:monitor(process, Parent),
receive
{'DOWN', Ref, _, _, _} ->
gproc:reset_counter(K)
exit ->
gproc:reset_counter(K)
end
end),
try begin
Res = F(K, Pid),
{true, Res}
end
after
Pid ! exit
end.
I thought about that, but it complicates things, is not inherently safe, and also introduces significant latency (assuming the monitoring process always does the reset).
It would probably at least be a good idea to switch lines 551 and 552, though. I forgot to do that.