etcd
etcd copied to clipboard
grpc-proxy hangs under high load with no start revision
Hey folks,
I'm running some benchmarks to validate etcd for a watch-heavy production workload and am running into an issue when incorporating the grpc proxy into the test. The issue only occurs when clients to the proxy do not specify a start revision for their watch request, and it only occurs when there roughly 3500 unique clients.
Cluster
The cluster consists of 5 nodes. 3 nodes are in region A (east), 2 are in region B (west).
There are 3 dedicated grpc proxy nodes in front of the cluster. All are running etcd 3.3.7 release binaries.
$ ./etcd --version
etcd Version: 3.3.7
Git SHA: 56536de55
Go Version: go1.9.6
Go OS/Arch: linux/amd64
Benchmark
The benchmark is as follows:
- Create 3500 etcd clients.
- For each of the 3500 clients, create a new
clientv3.Watcher
. - For each
Watcher
, create 3000 watches on 3000 predefined keys. - For each of 3000 predefined keys, put a value of 5000 bytes at that key (containing the nanosecond timestamp of
the time the
Put
was issued). - Measure the time for each (of the 10.5m total) watch to receive the update.
The benchmark is running from region B (west).
Results
Direct to cluster (fine)
The benchmark completes in a reasonable amount of time when pointed at just the 2 nodes in its local region. All watchers receive the updates they expect and the cluster remains stable. No special watch options are passed, the watches are created using watcher.Watch(context.Background(), key)
.
Cluster metrics:
Proxy (error)
When putting 3 grpc proxies in front of the cluster and connecting to those 3 using the benchmark tool, the proxies seem to hang during the test. They are fine with 3000 keys + 2500 clients, but around 3000 clients and above they start hanging. The observed effect is that the cluster sees many slow watchers, and the benchmark client no longer receives updates.
Cluster metrics:
Proxy metrics:
I took a goroutine dump while the benchmark clients were still connected and noticed most of the goroutines were hanging on the stream.Recv()
in the proxy.
Another interesting behavior was that after I killed the clients, the "total watchers coalescing" gauges on the proxies were negative:
I took a goroutine dump after killing the client, and most of the goroutines were hanging on acquiring the lock in the watchRanges.delete
method.
Proxy with watch start revision (fine)
Finally, I've noticed that when explicitly setting a start revision in the watch request from the client the proxies do not hang. In this test I make a request to the cluster to find out the current store revision. Then in my watch create request I use watcher.Watch(context.Background(), key, v3.WithRev(currentRev))
. It also seems that this is the first case where the proxies believe they are coalescing watchers.
Cluster metrics:
Proxy metrics:
Is this a bug with the proxy, or is it expected to only be able to coalesce watches if they specify a start revision? Also is there some reason that the proxies only start hanging around 3000 clients?
Any guidance would be greatly appreciated. Thanks!
I am seeing some other issues with gRPC proxy now. Will try to track down on this.
Hey @gyuho, have you been able to track down the proxy issues?
Sorry we just resolved other outstanding client issues that might be related. We still have a few more issues with client. Once we resolve them we will look into this.
Same problem on " the total watchers coalescing gauges on the proxies were negative", any progress on this issue?
Same problem.
- Etcd version: 3.3.11;
- Proxy version: 3.5.2;
I got etcd_debugging_mvcc_watcher_total=1916
from a proxy point, but clients can't get any response . PUT & GET & etc... is ok.
Hope this old bug can be resolved.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.