nap Slave recovery

Slave recovery

Open peterbourgon opened this issue 11 years ago • 1 comments

It looks like if a slave dies, some % of slave requests will simply fail. Feature request: support failure detection, and some kind of circuit-breaker recovery mechanism.

Dec 14 '13 16:12 peterbourgon

There are two kinds of failure in this scenario: transient and permanent. When a slave fails permanently, the infrastructure shuffling that follows is really coupled to each organisation. One could have VIPs in front of each slave and keep the reshuffling under the application layer but I don't think most solutions are that advanced yet. If the master fails, things get even dirtier with manual or automatic failover, this means that what previously was a slave could now be a master which just breaks all assumptions in this library. With so many fragilities, this sort of dynamic reconfiguration is perhaps better handled by the user of the library by re-instantiating the DB object.

With transient failures the story is different as they could originate from a rather large scope of events such as:

Network connectivity loss
Network connectivity degradation
Load spikes
Slow queries
Replication lag

Distinction between each of this is hard or impossible from the application layer. How do you envision a circuit breaking mechanism with such a variate array of failure modes and coarse detection capabilities? Detection and more generally, infrastructure state introspection, is an orthogonal concern of each organisation and I don't see how I could account for these in a general way. I am very happy to discuss it though!

Dec 17 '13 11:12 tsenart

nap nap copied to clipboard

Slave recovery

nap
nap copied to clipboard