cockroach gossip: abnormal CPU usage growth with increasing node count

gossip: abnormal CPU usage growth with increasing node count

Open Drahflow opened this issue 3 years ago • 22 comments

Is your feature request related to a problem? Please describe.

I am trying to estimate maximum reachable OLTP performance for a client of mine. To my frustration I was not able to scale a CockroachDB cluster to significantly more than 256 nodes, due to high CPU load when adding more nodes (most of which is taken up by gossip-protocol related functions according to profiling). My measurements suggest that the work done for gossiping on each node scales quadratically in the number of nodes, which puts an upper limit on the maximum cluster size at about 1000 nodes.

Describe the solution you'd like

The gossip protocol should only perform linear work in the number of nodes.

Describe alternatives you've considered

The gossip protocol intervals could be configurable so larger clusters could be achievable by trading away DDL and node failure detection speed. However, this would only add a small factor to the maximum size until the quadratic growth would have pushed failure detection times too high.

Jira issue: CRDB-4006

Jul 23 '20 15:07 Drahflow

Hello, I am Blathers. I am here to help you get the issue triaged.

I have CC'd a few people who may be able to assist you:

@tim-o (member of the technical support engineering team)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Jul 23 '20 15:07 blathers-crl[bot]

Thanks @Drahflow for filing this. Could you provide a bit more data from your experiments? What analysis did you carry out and how did you compute your conclusions?

Looking at your experiments would inform (and accelerate) our own investigation. Thanks

Jul 23 '20 16:07 knz

CRDB does not maintain n^2 gossip connections, it should be logarithmic. The gossip information disseminated cluster-wide should be linear to the number of nodes in the system, so it's a bit surprising to observe what you're observing. I'd be curious to see this data/profile as well.

Jul 23 '20 18:07 irfansharif

Cockroach version was v20.1.3-linux-amd64 downloaded from binaries.cockroachdb.com.

My experiments were as follows, always starting an fresh cluster, only init ever ran against it. I observed total CPU usage on the servers (which ran nothing else) for a few minutes after the last node joined.

Setup 1: 16 instances of CockroachDB per m5.16xlarge 4 servers / 64 instances: 3.5% CPU usage (keep in mind this is 16 instances, so only 0.21% CPU per instance.) 8 servers / 128 instances: 6.5% CPU usage 16 servers / 256 instances: 36.7 CPU usage (2.2% CPU per instance) The cluster did not stabilize for 24 server / 384 instances; so, what if I use less instances per server?

Setup 2: 8 instances of CockroachDB per m5.16xlarge: 4 servers / 32 instances: 1.7% CPU usage 16 servers / 128 instances: 5.0% CPU usage 32 servers / 256 instances: 22.0% CPU usage (2.7% CPU per instance) The cluster did not stabilize for 48 servers / 384 instances (even though the available CPU per instance is now of course twice that from setup 1).

Free memory was plenty during all runs (as reported per top). The only traffic to the cluster was my browser on the built-in UI "Overview" page.

I clicked the "Profile (with labels)" link in the built-in UI on both occasions when the cluster became unstable (which btw. is to say there were continuously some ranges unavailable, many underreplicated). In both cases the profile was >60% full of functions which had "Gossip" in their name and the rest looked like it might be related to I/O or network scheduling, but I didn't look very closely.

In both cases the jump from 128 to 256 nodes increased CPU consumption about 4-fold, which strongly suggest something is O(nodes^2).

In any case, it's easy enough to replicate: Try to bring up a 512 node cluster of 20.1.3 and marvel at the CPU consumption. If you only use one node per physical machine it could potentially even work, but even then the CPU profile should be perfectly clear.

Jul 23 '20 18:07 Drahflow

Re-reading the above, one correction:

always starting an fresh cluster,

is not correct. I setup a new cluster (incl. new machines) between experiment setup 1 and 2. However, within each setup I added more nodes to an existing cluster.

Jul 23 '20 19:07 Drahflow

In preparation of a larger-scale load test run, I today confirmed the same problem on 256 t3.medium instances hosting one cockroach instance each (so it's not some artifact of shared IPs). Also I checked the CPU profile more closely, the non-gossip related functions are golang garbage collection.

Aug 03 '20 16:08 Drahflow

I have looked at the code and I have found something "interesting":

// maxPeers returns the maximum number of peers each gossip node
// may connect to. This is based on maxHops, which is a preset
// maximum for number of hops allowed before the gossip network
// will seek to "tighten" by creating new connections to distant
// nodes.
λ maxPeers(nodeCount int) int {
  // ...
  maxPeers ≔ int(math.Ceil(math.Exp(math.Log(float64(nodeCount)) / float64(maxHops-2))))
  if maxPeers < minPeers {
    return minPeers
  }
  return maxPeers
}

So on each node we will have F(n) = ceil(exp(log(n)/3)) connections to other nodes. The division by 3 does not really alter the complexity of the formula. This means that we have O(n) connections to other nodes, per node.

So if we look at n nodes together, we have O(n * n) inter-node connections.

So the behavior is indeed quadratic (in the worst case).

Aug 04 '20 13:08 knz

Sorry I was wrong, this "/3" does matter, it turns the complexity into a cubic root.

So we have O(n^(1/3)) conns per node, and thus O(n^(4/3)) connections for the cluster as a whole. That's much better.

Aug 04 '20 14:08 knz

Hi @Drahflow thanks again for your investigation. We've queued an analysis to our roadmap internally, to check what is going on here and then take action.

Our current standpoint is that while the behavior you found is concerning, because it seems to violate our design objectives, the CPU overhead you're measuring on 200 nodes is not high enough yet for us to consider this issue as "urgent". We feel that users will be OK with 1-5% CPU overhead per CockroachDB node just for gossip up to 500 nodes.

Of course that does not mean we want to ignore the problem and that is why we queued an investigation. We plan to look at this in the v21.1 timeframe. Again thank you for your analysis.

Aug 06 '20 11:08 knz

For the record: I saw a (superficially similar) instability today on a 365 node cluster (mix of c5.9xlarge / m5.8xlarge) which was not CPU (nor RAM) starved. Is the gossip handling happening in a only one or two go-routines (and thus limited to a single core or two cores)?

The cluster initially appeared healthy, I started actual application load whereupon most nodes became "suspect". I then stopped the application load but >350 nodes remained in "suspect" status over the next 15 minutes (until I had to stop the experiment). Nodes occasionally moved to "healthy" for a short time, until again becoming "suspect". CPU utilization as reported in top was ~200% (i.e. two full cores) after the application load was stopped. Thus the low-number-of-goroutines hypothesis above.

Aug 20 '20 12:08 Drahflow

Which version were you running? How are you deploying the database (i.e. via k8s, systemd, etc).

Can you look at the storage dashboard and look at the Log Commit Latency and also the Hardware dashboard and look at the disk utilization metrics (dashboard screen shots are an appreciated thing if you're willing).

Aug 20 '20 13:08 ajwerner

Version: https://binaries.cockroachdb.com/cockroach-v20.1.3.linux-amd64.tgz Deployment: Terraform onto new EC2 instances (Ubuntu ami-0062c497b55437b01) and then started via

while sleep 1; do
  /usr/local/bin/cockroach start \
    --insecure \
    --advertise-addr="$LOCAL_IP":5432 \
    --listen-addr=0.0.0.0:5432 \
    --join=10.10.224.8:5432,10.10.224.9:5432,10.10.224.10:5432,10.10.224.11:5432,10.10.224.12:5432 \
    --cache=.25 \
    --max-sql-memory=.25 \
    --store=path=/mnt/cockroach \
    --max-offset=500ms \
    --background;
done

(with the understanding that further invocations will immediately terminate upon finding the running instance had locked the storage).

The exact same setup (but on "only" 265 instances) worked 100% fine under more than 50 times the load.

Aug 21 '20 12:08 Drahflow

I looked into this further today (via /_status/gossip/local):

The gossip key set is growing linearly with number of servers (distsql-draining:<node>, gossip-clients:<node>, store:<node> and liveness:<node>, node:<node>)
There are constant updates (at least on origStamp) on gossip-clients:<node>, liveness:<node> and store:<node>.

(But maybe this is as expected.)

Nov 04 '20 15:11 Drahflow

I just started up a fresh 256-node cluster on V21.1.0-alpha.3 and found the CPU overhead of a totally idle cluster surprisingly high. Each node is consuming 200-300% CPU (about 23% of available CPU) and I haven't done anything with the cluster yet.

Screen Shot 2021-04-30 at 11 13 33 AM

Graphviz version: cockroach-cpu.html.zip

Apr 30 '21 15:04 dankinder

Thanks for checking. Is that more or less than in the previous release? (We're expecting about the same)

Apr 30 '21 15:04 knz

I didn't compare to the previous, I've been using this release for the past month or two since I need some of the changes in it.

Apr 30 '21 16:04 dankinder

@dankinder, do you mind grabbing a profile for this running cluster, if still running? You could do it using the following (using the real address of course):

 go tool pprof 'http://localhost:8080/debug/pprof/heap?seconds=5&labels=true'

Apr 30 '21 16:04 irfansharif

This includes a 5-second CPU profile, heap, and heap delta: Archive.zip

Apr 30 '21 20:04 dankinder

When I tried testing a large cluster (95 nodes) with a bit of load on it I already started seeing some bad behavior from gossip. Specifically I'd see a small number of nodes occasionally reporting a large fraction of other nodes in the cluster as unlive (going by their liveness_livenodes metric) and all of their replicas as unavailable (going by their ranges_unavailable metric). At this time no nodes were actually failing liveness heartbeats or having liveness epochs incremented, so it appeared relatively clear that a delay in processing gossip entries was at fault.

After seeing this I decided to see what was going on.

The first thing that stood out is how ridiculously often tightenNetwork gets called. When I used the logspy debug endpoint with vmodule="gossip=2" I'd generally hit my limit of 100 log messages within 350ms due to 100 separate calls to tightenNetwork() all logging the distantHops: ... VEventf. That suggests it's being called ~285 times per second, and it's very much worth noting that tightenNetwork() takes the Gossip struct's mutex lock each time (with a Lock(), not an RLock())...

This is almost certainly worth fixing. We currently call maybeTightenLocked every time we're sent data on a server connection or receive a response back on a client connection. Making things worse in terms of mutex contention, maybeTightenLocked already requires the mutex to be held but doesn't just do the check directly -- it sends a signal over a channel of size 1, and the recipient of the channel send has to retake the mutex to do its check. I see no reason why we'd ever need to check tightenNetwork hundreds of times a second.

While we could probably go farther, a simple time-based check that ensured we only called it once a second (with an exception allowing it to run again sooner if the previous call actually did some work to tighten the network rather than deciding it wasn't needed). This change would be trivial and I'm not aware of any real downside.

The other more frustrating thing I found when poking around was just how much of the gossip network's capacity is eaten up by "gossip-clients:" infos. They get gossiped more than twice as frequently as liveness information:

$ grep -o "gossip-clients:" logspy-gossip-client.txt | wc -l
    1443
$ grep -o "liveness:" logspy-gossip-client.txt | wc -l
     601
The data looks roughly the same if I check the gossip server logs:
$ grep -o "gossip-clients:" logspy-gossip-server.txt | wc -l
    1407
$ grep -o "liveness:" logspy-gossip-server.txt | wc -l
     689

The above logspy-gossip-client.txt came from collecting 2000 log lines from /debug/logspy?count=2000&duration=30s&grep=gossip&flatten=1&vmodule=client=2 and logspy-gossip-server.txt came from the same but with vmodule=server=2.

There's an old comment in gossip.go that mentions the gossip clients key has to be gossipped very frequently because "we want to detect gossip partitions faster that the node liveness timeout (9s)", but that looks like it was added by @petermattis in 18ce86b1e87b71aeb22bdaf7e39e97b448e3ca50 (#28963), and the main goal of that PR was to make it easier to debug gossip. It appears he also threw in the side goal of attempting to get rid of the sentinel key in the future, but but it looks like that never actually happened -- the sentinel key still appears to be what we use to detect partitions/stalls. The "gossip-clients" keys look like they're only used for logging the gossip graph, which is definitely not worth sending one info per-node every two seconds... And I think it makes sense to keep it that way given how much cheaper it is to frequently propagate a single sentinel key than one "gossip-clients" key per node.

It seems to me that we should be sending the gossip-clients keys much less frequently given their very limited use.

There's also some other minor inefficiencies around the key names for each gossip info being longer and more verbose than is really needed -- the infos themselves are protocol buffers, but the keys aren't so we pay for each key name on every send. And even pickier than that, we re-gossip all the fields in the store descriptor every minute (or less) even though only some of them can actually change.

But these are significantly more minor concerns -- it looks like tightenNetwork and the "gossip-clients" keys are the big opportunities for easy improvements.

Sep 22 '22 15:09 a-robinson

Dumb example fix for the tightenNetwork performance issue: https://github.com/cockroachdb/cockroach/pull/88472

Sep 22 '22 15:09 a-robinson

Looking at it a bit more, visitInfos is also burning a bunch of CPU. There are a couple issues there:

We store all infos in a single map regardless of their type, but a number of our visitInfos() calls only care about a particular type of info. If we split the infoStore's infoMap up into a separate map for each type of info and let callers of visitInfos specify the key prefix they want, we could avoid iterating over a bunch of stuff the caller doesn't care about.
We keep a few types of infos around for decommissioned nodes forever, it seems. So as more nodes in a cluster are decommissioned and replaced over time, the gossip network wastes more and more cpu iterating over these mostly pointless entries (and hits bottlenecks sooner as a result). In particular, I see values for each of the following key prefixes for all node IDs that have ever existed in a cluster even if they were decommissioned long ago -- distsql-draining:<nodeid>, distsql-version:<nodeid>, and liveness:<nodeid>. Ideally none of them would be needed, but FWIW it seems to me that the liveness key is much more reasonable than the distsql ones.

Splitting up the infos map would probably be a medium sized change but I wouldn't expect it to have any big issues.

I don't know, however, whether there's some reason those old keys are needed for decommissioned node IDs for the lifetime of the cluster. It does seem relevant that while the liveness keys for decommissioned nodes are actively re-gossiped regularly, it looks like the distsql keys haven't been gossiped since the corresponding node was shut down and they're only still around because they got given a ttlStamp of MaxInt64.

Sep 22 '22 17:09 a-robinson

I suppose that given that they're never re-gossiped once the nodes are removed, something like the new crdb_internal.unsafe_clear_gossip_info("gossip-key") function from #85013 could be used to manually clear out the old distsql keys from decommissioned nodes. That's certainly less ideal than having them automatically dealt with (e.g. re-sending those gossip infos with a sooner TTL when decommissioning the given node ID) but it's at least a solvable problem for someone who cares.

Sep 22 '22 19:09 a-robinson

@kvoli @nvanbenschoten I'm also up for helping with a change that meaningfully changes the frequency of gossiping "gossip-clients" keys given that all they're used for is the periodic logging of the connectivity graph (and manual debugging of gossip issues I guess, although I'd be surprised if that comes up often). But I don't want to write up a change that you guys aren't interested in accepting/maintaining since that may be a bit more contentious than #88472.

For example I personally think a cluster setting to control whether we gossip the gossip-clients keys at all would be reasonable (instead of just gossiping them less frequently), but I don't know what you guys think.

Or if one of the other ideas is more enticing as something you'd want, let me know. But the gossip-clients keys appear to be the most egregious waste of cpu / mutex time.

Sep 23 '22 16:09 a-robinson

@kvoli @nvanbenschoten I'm also up for helping with a change that meaningfully changes the frequency of gossiping "gossip-clients" keys given that all they're used for is the periodic logging of the connectivity graph (and manual debugging of gossip issues I guess, although I'd be surprised if that comes up often).

Changing the frequency of gossip-clients appears strictly beneficial. A cluster setting sounds good.

crdb_internal.unsafe_clear_gossip_info("gossip-key") also sounds enticing but a bigger item.

Sep 27 '22 18:09 kvoli

Changing the frequency of gossip-clients appears strictly beneficial. A cluster setting sounds good.

Which sounds good - a cluster setting to change the frequency or to stop gossiping it entirely? To be clear, if we stop gossiping it entirely then the only real consequences are:

We can no longer include the connectivity graph in the periodic logging of (*Gossip).LogStatus.
We can no longer use it to determine whether each gossiped node is alive in crdb_internal.gossip_nodes. But I don't see why that code couldn't just use the gossiped node liveness keys instead.
The crdb_internal.gossip_network virtual table wouldn't work anymore (other than to display the local node's client connections, but that's obviously not the same thing as what the table currently does).

crdb_internal.unsafe_clear_gossip_info("gossip-key") also sounds enticing but a bigger item.

I'm not sure whether I'm misunderstanding your comment or you missed the link to #85013, but that function already exists as of #85505. I tried it out and it worked well for clearing out old decommissioned nodes' "distsql-version" and "distsql-draining" keys.

Sep 27 '22 18:09 a-robinson

It's fine if you aren't interested in a patch at the moment, but I thought I'd share some data from having completely ripped out the "gossip-clients" keys just to give a bit more evidence of how wasteful they are, because it's even bigger than I expected.

I took v22.1.6, then applied two patches: #88472 and https://github.com/a-robinson/cockroach/commit/952c5011bb81e6315d677e1a160f44c90a5ae4e2

Testing on that 95 node cluster that had been fairly bogged down by gossip, deploying the change made a huge difference in the gossip_bytes_sent/received and gossip_infos_sent/received metrics as well as in CPU usage:

Before (aggregated across all nodes in the cluster): Sent: 650 MB/s Received: 310 MB/s Sent: 23.9k infos/s Received: 21.3k infos/s

After (aggregated across all nodes in the cluster): Sent: 31.3 MB/s Received: 16.2 MB/s Sent: 6.5k infos/s Received: 6.4k infos/s

CPU usage per node dropped by nearly a full core (by about 90 CPU seconds/second on the 95 node cluster).

The savings are also pretty dramatic at somewhat smaller sizes too, with a 75 node cluster saving 40-50 cores and having a before/after of:

Before: Sent: 340 MB/s Received: 165 MB/s Sent: 14.9k infos/s Received: 13.8k infos/s

After: Sent: 10 MB/s Received: 6 MB/s Sent: 3.8k infos/s Received: 3.5k infos/s

The numbers were a bit less dramatic on a completely fresh 75 node cluster that had no data in it, load on it, or history of decommissioned nodes, but it still cut out about two-thirds of the gossiped bytes/infos and saved about 10 cores.

Oct 04 '22 18:10 a-robinson

Testing on that 95 node cluster that had been fairly bogged down by gossip, deploying the change made a huge difference in the gossip_bytes_sent/received and gossip_infos_sent/received metrics as well as in CPU usage:

Before (aggregated across all nodes in the cluster): Sent: 650 MB/s Received: 310 MB/s Sent: 23.9k infos/s Received: 21.3k infos/s

After (aggregated across all nodes in the cluster): Sent: 31.3 MB/s Received: 16.2 MB/s Sent: 6.5k infos/s Received: 6.4k infos/s

That's a great improvement. I think it would be great to patch in. Thanks for taking a look into this further @a-robinson.

Oct 05 '22 13:10 kvoli

I think it would be great to patch in.

Sorry for not being totally sure what you mean, but by "it" do you mean a complete rip out like in https://github.com/a-robinson/cockroach/commit/952c5011bb81e6315d677e1a160f44c90a5ae4e2 and accepting the minor changes called out in https://github.com/cockroachdb/cockroach/issues/51838#issuecomment-1259907334? Or something else like adding a cluster setting that disables it?

Personally I'm still in the camp of just ripping it out since I don't think I ever benefited from it even back when I worked on cockroach full time, but I don't have any visibility into how often it comes in handy for you guys these days.

Oct 05 '22 14:10 a-robinson

yo @a-robinson, https://github.com/a-robinson/cockroach/commit/952c5011bb81e6315d677e1a160f44c90a5ae4e2 LGTM. These internal vtables are not that useful, definitely not worth the cost you're seeing. Some of them are being used in roachtests that have either already aged or could look at alternative sources instead. The connectivity graph has come in handy in some recent incidents, but again, I don't think it's worth the cost. If you want to polish up https://github.com/a-robinson/cockroach/commit/952c5011bb81e6315d677e1a160f44c90a5ae4e2 for a PR that SGTM. If there's roachtest fallout from it, we're happy to squash them ourselves. I don't think these changes are going to be backportable -- even if we tack on release notes, it seems unwise to rip out vtables in already published releases. So if you care about backporting it, I'd do so by introducing a cluster version to stop gossiping the gossip-clients key but left disabled by default. On master, and perhaps 22.2 (which will be out in a few weeks), we can enable the cluster setting. Does that sound all right to you? Also happy to take over backport duty to 22.2 since it's not out yet.

Oct 05 '22 16:10 irfansharif

These aged roachtests could also just explicitly enable this cluster setting if we want to keep them (== roachtests) around, but that's all just unnecessary hedginess -- this feels fine to rip out altogether on master.

Oct 05 '22 16:10 irfansharif

cockroach cockroach copied to clipboard

gossip: abnormal CPU usage growth with increasing node count

cockroach
cockroach copied to clipboard