etcd
etcd copied to clipboard
Support paginating by key-value size
What would you like to be added?
Allow RangeRequest to specify a size limit to paginate the result if the size of result exceeds the limit.
Why is this needed?
This would allow clients to constrain the response size to avoid overwhelming etcd server with expensive requests.
cc @serathius
Is this for Kubernetes use case or general case?
FYI, Kubernetes recently added
maxLimit = 10000
Thanks for the link! Be default the object size is limited in 1.5MiB. The pageSize can be used to reduce memory spike. In kubernetes, it seems it can resolve this issue.
Is this for Kubernetes use case or general case?
I am thinking of Kubernetes use case, but it definitely benefits general cases as well
FYI, Kubernetes recently added
maxLimit = 10000
It seems somehow to mitigate the similar problem. AIUI, the intention of it is actually increase the response size to reduce the calls to etcd.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
@ahrtr can you please also try re-opening this?
Note, not sure if large objects are as much of a concern when compared to large amounts of small objects. Response size can be controlled on user side as long as they limit maximal size of the object and number of objects per response.
Context https://github.com/kubernetes/enhancements/issues/2340#issuecomment-1640185697
Small objects are much more disruptful for clients. See the difference in CPU usage between "Disabled Large Objects" and "Disabled small objects" for apiserver. Difference is between 3 CPUs and 72 CPUs.
Overall I still think that long term better solution is to introduce watch cache to etcd client to get the same awesome benefits (2-10 times CPU reduction, 20-50 times latency reduction) instead of creating workaround.
As far as the problem is concerned for this issue to address, the total size is most relevant as it results in proportional memory usage for etcd. Fixing this is a measure to avoid query-of-death (ie. OOM) more than a performance improvement.
Besides that, I don't see major conflict in having both long term and short term solution at the same time.
Besides that, I don't see major conflict in having both long term and short term solution at the same time.
I expect Kubernetes will have consistent reads from cache before etcd releases v3.6. Getting watch cache out will not be fare out. Short term will be immediately deprecated.
IMO, this pagination by size is still helpful after we have consistent reads in k8s because k8s still needs to request etcd with non-streaming requests which might be too large and explode etcd memory usage, right?
No, goal is to serve non-streaming requests from watch cache.
you mean apiserver's watch cache, right? how about when apiserver starts and gets all objects from etcd into its own cache?
Single get all objects request is not a big problem, the problem are frequent fetch request done by badly written Kubernetes controllers. Such requests require a lot of allocations and waste time for proto marshalling/unmarshalling causing etcd memory to grow in uncontrolled way.
Note that all Kubernetes requests are sharded by resource, and Kubernetes limits size of response to 2GB, meaning that a initial request for all resources allocates maybe couple to tens of GB, it's big, but not horrible. Compare it to multiple controllers that fetch single 1GB resource every second.
In 16300, the newly implemented maxbytes is defaulted to 0, which would not break any existing use case, and for cases where they do want to control the max size, this could be useful. Note that kube-apiserver maxlimit limits the object counts, but object with large size and more revisions would still cause problem.
I don't see a strong reason not to merge. @serathius Do you have any specific concerns?
I don't have any concerns about compatibility, more about feature bloat. Based on my understanding of SIG api-machinery roadmap this feature will never be used by Kubernetes.
I don't think etcd should implement a feature for Kubernetes without having a prior design and approval from Kubernetes stakeholders. Before we add work for ourselves let's get LGTMs from SIG api-machinery.
cc @jpbetz @deads @logicalhan @wojtek-t
@linxiulei I share the same understanding as https://github.com/etcd-io/etcd/issues/14809#issuecomment-1723136129. Please raise a KEP on Kubernetes and get K8s's approval firstly. I don't think etcd community will reject a feature which is useful and approved by K8s.
Turns out we have a case where we ran out of the size aspect of the objects. We get the logs from apiserver:
I0821 23:03:24.662977 17 trace.go:205] Trace[419549052]: "List(recursive=true) etcd3" key:/secrets,resourceVersion:,resourceVersionMatch:, limit:10000,continue: (21-Aug-2023 23:03:23.767) (total time: 895 ms): Trace[419549052]: [895.681318ms] [895.681318ms] END
W0821 23:03:24.662996 17 reflector.go:324] storage/cacher.go:/secrets: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647)
E0821 23:03:24.663007 17 cacher.go:425] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647); reinitializing...
{"level":"warn","ts":"2023-08-17T23:03:24.662Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00357ae00/10.1.1.3:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647)"}
Imagine a case where there's just a load of big secrets. I'm still figuring out whether this is caused by badly written operators :)
The GRPC limit we can't really change, because it's already int32.Max: https://github.com/grpc/grpc-go/blob/8cb98464e5999aa2fd57bbf5b23bd5a634f4b2f5/server.go#L59
We should enable some kind of paging for range responses, but I agree that this needs to go the KEP route first.
As a start, I've quickly added a metric that allows us to alert on common large range requests (secrets / configmaps / image streams) with #16881.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.