pooler icon indicating copy to clipboard operation
pooler copied to clipboard

Node Failure Handling,

Open noizu opened this issue 7 years ago • 1 comments

  • Pooler will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers. (Running into problems on production with riak ts nodes periodically crashing due to GCE NVME local disk instability).

  • Depending on number of active workers (I have a cluster doing about a million riak writes per minute, and saw cascading failures with 2048 connections per node x 6 riak nodes duplicated across 5 elixir servers) node failure can cascade to halt pooler and the OTP tree.

  • In general are there any recommended strategies for handling group member failures gracefully. I could hook up process listeners for example and automate pool add/remove or something like that but if there is some possible mechanism to serve fewer connections from a group if it has a recent high failure rate would be nice if possible.

  • using pooler with https://github.com/drewkerrigan/riak-elixir-client

noizu avatar May 29 '18 06:05 noizu

Not sure I fully understand the problem

will halt OTP startup if one of a group members is unavailable but configuration specifies non zero init workers

what do you mean by "one of a group members is unavailable"? When start_mfa is blocking and does not return for a long time?

seriyps avatar Apr 09 '23 00:04 seriyps