redpanda
redpanda copied to clipboard
Continual rack-awareness rebalancing for multi-az deployments
In multi-az cloud deployments it is important that replicas are placed into different AZs to be resilient to single AZ faults.
Currently rack awareness isn't strictly a hard constraint. Should it be? Consider the case of an AZ going down. At some point we decide that we should make replacement copies of data. Two options: a new AZ comes online to replace the lost nodes (all new node ids) and in another case it doesn't. In the former case we need to make sure rack awareness is taken into account, and in the second case it may be worth while to make a 3rd copy on one of the remaining two AZs (3 copies on 3 AZs is best, but 3 copies on 2 AZs is next best). Because of this later scenario it may make sense to be strictly a hard constraint, but rather a hierarchy of constraints.
What cluster operations might lead to violations of rack awareness? The system should be able to identify cases in which replica placement should be changed to meet goals related to resilience (ie multi-az rack awareness).
Possibly related: https://github.com/redpanda-data/redpanda/issues/6058
I think it needs to be 'eager and best effort' for the reasons you mention but we should really ensure this logic is wired into every workflow (e.g. #6058).
That is, we should always spread according to the configured placement / replication policy (racks per partition=replication factor seems to be our current policy), and in absence of that honour the replication factor (lower in the hierarchy of constraints) without the implicit/explicit rack constraint and fire a low level alert of some kind. Not an 'under-replicated partitions' state but an 'under-redundant-partitions' or some kind of minor 'fault tolerance constraint violation' metric/alert.
That is, (and in the interest of hands-free administration in cloud) I lean towards the approach of self-healing as much as possible in all cases vs waiting some arbitrary time util an AZ might come back (probably requiring some manual intervention)
This of course means when the AZ does come back (or, alternatively, with an extended outage when the admin/infra spawns nodes in yet another AZ) the balancer will have to adjust again to move things around (which seems just unavoidable work to make this work properly)
Other systems have a pluggable policy (4-5 examples here): https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsBlockPlacementPolicies.html. Not sure if we need to go that far but at least the policy should be easy to evolve over time if we want to make it smarter.
@jcsp
btw, if you think about how we handled partition moves with the balancer in 22.2 (adding an extra replica while movement is in progress), this is along the same lines (prefer satisfying replication factor to rack constraints during the failure to be on the safe side). I would expect the continuous balancer already handles this part though, and it's only the 'when AZ comes back' behavior we need to add?
prefer satisfying replication factor to rack constraints during the failure to be on the safe side
I would expect the continuous balancer already handles this part though, and it's only the 'when AZ comes back' behavior we need to add?
@mattschumpert For the most part yes. But there is a subtle difference between how rack awareness constraint is currently implemented and how we would like it to behave. Ideally, rack awareness should be a hard constraint when it can be satisfied, otherwise it can be a soft one. But right now it is always a soft constraint (meaning that it is just a hint, albeit a strong one). The difference is that it can interact somewhat surprisingly with other soft constraints such as preferring a less loaded replica. In theory this interaction can lead to rack awareness being violated, even when there is a possibility to satisfy it.
I think describing all problems with the current partition placement architecture and suggesting a way forward merits an RFC. I'll work on one.
@mattschumpert what kind metrics do we want for this? Will total number of partitions with violated rack constraint be enough?
Yes I think that's sufficient and helpful, but A bonus would be to know the # of racks (AZs) currently down (how many racks fewer are available than the max requested by any partition/topic's replication factor). Then it's easy to see not just 'X partitions are affected' but rather ' Y things are wrong.. I need to go deal with Y things. e.g. 2 racks are down or just 1 AZ is missing, and I can correlate that with other information (e.g. AWS AZ outage)