redpanda Unified, coherent, scalable cluster quorum metadata

trafficstars

This issue came up in discussion on PR #4803 a couple of times (here), also (here). This is a followup PR to discuss the future vision and reach consensus on how we want to evolve this part of redpanda.

Short term question that prompted this is; does the set of healthy nodes and disks belong in metadata_cache. A couple of people indicated this is not really its intended purpose, though it is already doing some of that, and I do not see other existing APIs which can get the set of healthy disks on healthy nodes. This leads to the longer term questions:

What is the future of members_table, metadata_cache, health_monitor, and so on?

When I say "quorum" here, I mean the concept of cluster-level quorum, which currently maps to our controller. This is the top-level set of information that you need to start the request processing pipeline. This typically consists of a set of nodes and disks along with some basic health information (i.e. a unhealthy or misconfigured node may not be allowed to join the cluster quorum).

The primary issues I see with our cluster metadata system(s) today are:

Allowing clients to submit requests when we are not in cluster quorum. Today, a client needing to determine if brokers are "in cluster quorum" looks like this, which does this. IMHO, it should just be a simple retry on the first kafka request, with redpanda handling the details of when it can service the request.
Lack of invariants around basic quorum membership and health information. We should be able to fetch the metadata needed to build an execution plan (performing requested client operation) and have it guaranteed to be consistent and valid for the life of the request. If cluster quorum changes during the request, we should have a standard error handler which retries the operation with updated quorum state. That is, you can write correct code which depends on quorum metadata, without custom (per-codepath) error handling and retries (e.g. as suggested here).
Lack of (IIUC) coherent cluster metadata. Today you have to query multiple eventually-consistent sources of truth (members table, health_monitor).
Scalability and efficiency. It is likely we can provide stronger semantics with less code and runtime overhead. For scalability, growing your cluster-level consensus groups beyond 100 nodes typically requires keeping your top-level cluster quorum state as simple as possible, and using alternate scaling approaches to fan out from that data. Your cluster quorum can just state the set of resources (nodes, disks, etc) you need to take the coordinate the next steps in the processing path. We also have the opportunity to eliminate per-NTP raft overhead, by using cluster-level heartbeats as proxies for NTP-level ones.

Jun 08 '22 20:06 ajfabbri

We currently have two extremes:

the controller raft group, which provides robust leadership election and replication of the persistent parts of the cluster state. This includes members_table, topics_table etc.
health monitor and metadata dissemination service, which share out updates of non-persistent state within some time bounds.

There very much is already a cluster quorum: that's the controller raft group. The messy part is that we don't use that raft group for everything, mainly because we don't want to write everything to disk just to share it out (we don't do a memory-only state machine replication tied to that group, although we could).

Health monitor (basically have a central piece of state on the controller leader, and keep a time based cache on other nodes) is an expedient way forward for modest system sizes, but clearly becomes a scale issue. Basic health information (especially the upness of a node) will be a good candidate for moving to a gossip protocol to scale, and because gossip tends to lend itself better to detecting flaky/asymmetric failures.

The contents of metadata dissemination service (i.e. who is leader for what partition), I'm not so sure. This metadata is only needed to satisfy Kafka client requests, and the Kafka protocol is designed to be sloppy: it has no expectation that metadata from a particular node will be up to date, and it expects to find out when it ultimately tries to connect to the alleged leader whether it's really the leader. The main property that's important here is prompt propagation of updates, rather than any kind of global consistency.

Jun 08 '22 20:06 jcsp

There very much is already a cluster quorum: that's the controller raft group.

Yep, this is clear.

I don't think topics and partition metadata has to be a part of cluster quorum. I think disk health (reachability) probably should be. Disks is a smaller set than topic metadata, and once we have consensus on the set of nodes/disks, we can use that to coordinate safe shared access to other and/or more scalable structures. For example, for topic metadata, you can imagine storing it outside of the controller raft group, and using some sort of interlocks with "children" raft groups (I don't know what this would look like) for correctness, or something like transactional updates to a replicated tree or table using a lock manager and transactional buffer management. This would allow you to have arbitrary scaling of topic metadata and take load of the controller.

Jun 08 '22 21:06 ajfabbri

Adding some code links for reference, for when we get around to discussing this topic.

Bit of consensus code that illustrates that we don't know whether or not we are talking to a node that is in cluster quorum. I ran into this message when trying to figure out why a test was flaky in CI.

Jul 13 '22 22:07 ajfabbri

IIUC, #5884 is related to the lack of cluster health invariant.

Aug 10 '22 00:08 ajfabbri

#5949 looks related per this comment.

Aug 12 '22 02:08 ajfabbri

Had a good (but short) discussion on this. One of the questions was about "split brain" scenarios, and whether or not they exist as I claimed. It is a loose term, but one example that came up during the full disk work:

Cluster of nodes 0..7. Controller decides node 7 is flaky, excludes from group (e.g. link flapping). 7's disk space is filling up, and producers are still writing to its group. Since the controller doesn't consider this nodes membership and/or health data, the disk fills up, bringing down part or all of the cluster (depending on whether or not it rejoins other groups)

IIUC, most users of health information may suffer from similar issues.

This is already being addressed in other efforts, so I'll resolve this issue. Feel free to reopen or annotate if you find it useful, of course. Thanks again for the discussion!

Sep 15 '22 04:09 ajfabbri

redpanda redpanda copied to clipboard

Unified, coherent, scalable cluster quorum metadata

redpanda
redpanda copied to clipboard