percona-server icon indicating copy to clipboard operation
percona-server copied to clipboard

WIP: only apply flow control when majority of members need it

Open jchorl opened this issue 2 years ago • 0 comments

Summary

This PR is a proposal - we should chat further.

This PR modifies flow-control logic to only kick in when majority of nodes require it. The use-case is single bad nodes. If a server becomes unhealthy, it will start lagging in certification queue and applier queue. Once either of these hit a GR tunable threshold, performance tanks. Basically, if a node is having problems, perf issues are inevitable, even if a majority of nodes are healthy. And they'll never auto-remediate.

There are caveats!

  1. This changes behaviour - GR no longer caters to the slowest node, but the majority of nodes (not always required). This is particularly impactful on heterogenous infra where it is critical that all secondaries remain in-sync with the primary.
  2. Therefore stale reads can be very stale, and failovers can take a long time if they fail to the furthest-behind replica
  3. This relies on m_info only containing one item per-member. Members are transient. Members holding entries may no longer be alive.

While it worked in testing in a very specific use-case, it likely will not fit all use-cases.

Why flow control

MySQL has good docs on flow control: https://dev.mysql.com/doc/refman/8.0/en/group-replication-flow-control.html

What's the problem

Flow control desecrates write performance. We've seen it take p99 from ~30ms to 2s. It can kick in even when a single secondary is unhealthy and can't keep up with writes. On common cloud providers, single-instance failures are quite common. Therefore GR must be resilient to single-node failures.

Alternatives considered

  • Define an entirely new flow control policy - separate from QUOTA and OFF. Yeah, this is reasonable. If we decide we want some policy like FCM_QUOTA_MAJORITY, we can figure out the technical bits.
  • The control plane should detect and mitigate replicas that fall behind - sure, and maybe mysql should be tunable in this regard as well. Outside of mysql, this is toilsome to implement. GR should be resilient itself.
  • Catering to only a single node falling behind instead of catering to majority of nodes - this is reasonable too. 1 node vs. minority, either is fine.
  • Fail to the least-behind replica - this is an optimization. MySQL should do this, but that is orthogonal. Today, MySQL implements deterministic failover logic in GR: https://dev.mysql.com/blog-archive/group-replication-prioritise-member-for-the-primary-member-election/
  • Min quotas - MySQL has GR min quota var: https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replication_flow_control_min_quota . But this is different too - it assigns a min quota to each node before it will impact primary write perf. This is quite similar to just disabling flow control. If the quota is lower than tps, it dampens f-c impact a bit. If it's higher, it means all nodes can fall behind up to that amount. But most painful, it must be tuned to linger around tps to mitigate single bad nodes, and then it affords all nodes to be that bad.
  • Reduce impact of flow-control - yeah, this is reasonable too. Min quotas kinda do this, see ^.

Possibly, we only want to expose this option for GR single-leader mode.

Results

In testing with sysbench, we saw a baseline of 70ms p99. We then stressed a replica in GR. With standard flow-control mechanisms, we saw p99 go to 590ms. This was with mysql single leader, paxos single leader mode. Clearly, f-c fired.

With the patch, we saw p99 insert stay around 70ms. Clearly, GR just wasn't considering the troubled node.

Thoughts

For multi-primary mode, this can be problematic. Not only for stale reads, but also because that slow-secondary will fall further behind, and need to work through the certifier queue when taking writes.

In single-primary mode, either you force all members to keep up and respond quickly when they don't. Or you you allow single members to lag behind and live with the consequences. This should be an operator choice. It depends on RPO/RTO. But catering to the slowest node in the group is crazy ambitious.

jchorl avatar Jun 16 '22 05:06 jchorl