Node Failure Handling,

Open noizu opened this issue 7 years ago • 1 comments

Pooler will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers. (Running into problems on production with riak ts nodes periodically crashing due to GCE NVME local disk instability).
Depending on number of active workers (I have a cluster doing about a million riak writes per minute, and saw cascading failures with 2048 connections per node x 6 riak nodes duplicated across 5 elixir servers) node failure can cascade to halt pooler and the OTP tree.
In general are there any recommended strategies for handling group member failures gracefully. I could hook up process listeners for example and automate pool add/remove or something like that but if there is some possible mechanism to serve fewer connections from a group if it has a recent high failure rate would be nice if possible.

May 29 '18 06:05 noizu

Not sure I fully understand the problem

will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers

what do you mean by "one of a group members is unavailable"? When start_mfa is blocking and does not return for a long time?

Apr 09 '23 00:04 seriyps