nap
nap copied to clipboard
Slave recovery
It looks like if a slave dies, some % of slave requests will simply fail. Feature request: support failure detection, and some kind of circuit-breaker recovery mechanism.
There are two kinds of failure in this scenario: transient and permanent.
When a slave fails permanently, the infrastructure shuffling that follows is really coupled to each organisation.
One could have VIPs in front of each slave and keep the reshuffling under the application layer but I don't think most solutions are that advanced yet. If the master fails, things get even dirtier with manual or automatic failover, this means that what previously was a slave could now be a master which just breaks all assumptions in this library. With so many fragilities, this sort of dynamic reconfiguration is perhaps better handled by the user of the library by re-instantiating the DB
object.
With transient failures the story is different as they could originate from a rather large scope of events such as:
- Network connectivity loss
- Network connectivity degradation
- Load spikes
- Slow queries
- Replication lag
Distinction between each of this is hard or impossible from the application layer. How do you envision a circuit breaking mechanism with such a variate array of failure modes and coarse detection capabilities? Detection and more generally, infrastructure state introspection, is an orthogonal concern of each organisation and I don't see how I could account for these in a general way. I am very happy to discuss it though!