etcd grpc-proxy hangs under high load with no start revision

Hey folks,

I'm running some benchmarks to validate etcd for a watch-heavy production workload and am running into an issue when incorporating the grpc proxy into the test. The issue only occurs when clients to the proxy do not specify a start revision for their watch request, and it only occurs when there roughly 3500 unique clients.

Cluster

The cluster consists of 5 nodes. 3 nodes are in region A (east), 2 are in region B (west).

There are 3 dedicated grpc proxy nodes in front of the cluster. All are running etcd 3.3.7 release binaries.

$ ./etcd --version
etcd Version: 3.3.7
Git SHA: 56536de55
Go Version: go1.9.6
Go OS/Arch: linux/amd64

Benchmark

The benchmark is as follows:

Create 3500 etcd clients.
For each of the 3500 clients, create a new clientv3.Watcher.
For each Watcher, create 3000 watches on 3000 predefined keys.
For each of 3000 predefined keys, put a value of 5000 bytes at that key (containing the nanosecond timestamp of the time the Put was issued).
Measure the time for each (of the 10.5m total) watch to receive the update.

The benchmark is running from region B (west).

Results

Direct to cluster (fine)

The benchmark completes in a reasonable amount of time when pointed at just the 2 nodes in its local region. All watchers receive the updates they expect and the cluster remains stable. No special watch options are passed, the watches are created using watcher.Watch(context.Background(), key).

Cluster metrics:

screen shot 2018-06-12 at 4 58 44 pm

Proxy (error)

When putting 3 grpc proxies in front of the cluster and connecting to those 3 using the benchmark tool, the proxies seem to hang during the test. They are fine with 3000 keys + 2500 clients, but around 3000 clients and above they start hanging. The observed effect is that the cluster sees many slow watchers, and the benchmark client no longer receives updates.

Cluster metrics:

screen shot 2018-06-12 at 3 59 48 pm

Proxy metrics:

screen shot 2018-06-12 at 4 01 59 pm

screen shot 2018-06-12 at 4 02 07 pm

I took a goroutine dump while the benchmark clients were still connected and noticed most of the goroutines were hanging on the stream.Recv() in the proxy.

Another interesting behavior was that after I killed the clients, the "total watchers coalescing" gauges on the proxies were negative:

screen shot 2018-06-12 at 5 18 32 pm

I took a goroutine dump after killing the client, and most of the goroutines were hanging on acquiring the lock in the watchRanges.delete method.

Proxy with watch start revision (fine)

Finally, I've noticed that when explicitly setting a start revision in the watch request from the client the proxies do not hang. In this test I make a request to the cluster to find out the current store revision. Then in my watch create request I use watcher.Watch(context.Background(), key, v3.WithRev(currentRev)). It also seems that this is the first case where the proxies believe they are coalescing watchers.

Cluster metrics:

screen shot 2018-06-12 at 7 02 41 pm

Proxy metrics:

screen shot 2018-06-12 at 7 02 53 pm

Is this a bug with the proxy, or is it expected to only be able to coalesce watches if they specify a start revision? Also is there some reason that the proxies only start hanging around 3000 clients?

Any guidance would be greatly appreciated. Thanks!

Jun 12 '18 23:06 schallert

I am seeing some other issues with gRPC proxy now. Will try to track down on this.

Jun 19 '18 17:06 gyuho

Hey @gyuho, have you been able to track down the proxy issues?

Aug 01 '18 17:08 schallert

Sorry we just resolved other outstanding client issues that might be related. We still have a few more issues with client. Once we resolve them we will look into this.

Aug 01 '18 17:08 gyuho

Same problem on " the total watchers coalescing gauges on the proxies were negative", any progress on this issue?

Sep 24 '20 07:09 tomasliu-agora

Same problem.

Etcd version: 3.3.11;
Proxy version: 3.5.2;

I got etcd_debugging_mvcc_watcher_total=1916 from a proxy point, but clients can't get any response . PUT & GET & etc... is ok.

Hope this old bug can be resolved.

Mar 08 '22 03:03 HtHuanChen

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sep 21 '22 02:09 stale[bot]

etcd etcd copied to clipboard

grpc-proxy hangs under high load with no start revision

Cluster

Benchmark

Results

Direct to cluster (fine)

Proxy (error)

Proxy with watch start revision (fine)

etcd
etcd copied to clipboard