Node Failure Handling,
-
Pooler will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers. (Running into problems on production with riak ts nodes periodically crashing due to GCE NVME local disk instability).
-
Depending on number of active workers (I have a cluster doing about a million riak writes per minute, and saw cascading failures with 2048 connections per node x 6 riak nodes duplicated across 5 elixir servers) node failure can cascade to halt pooler and the OTP tree.
-
In general are there any recommended strategies for handling group member failures gracefully. I could hook up process listeners for example and automate pool add/remove or something like that but if there is some possible mechanism to serve fewer connections from a group if it has a recent high failure rate would be nice if possible.
- using pooler with https://github.com/drewkerrigan/riak-elixir-client
Not sure I fully understand the problem
will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers
what do you mean by "one of a group members is unavailable"? When start_mfa is blocking and does not return for a long time?