etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Support paginating by key-value size

Open linxiulei opened this issue 3 years ago • 19 comments

What would you like to be added?

Allow RangeRequest to specify a size limit to paginate the result if the size of result exceeds the limit.

Why is this needed?

This would allow clients to constrain the response size to avoid overwhelming etcd server with expensive requests.

linxiulei avatar Nov 19 '22 16:11 linxiulei

cc @serathius

linxiulei avatar Nov 19 '22 16:11 linxiulei

Is this for Kubernetes use case or general case?

FYI, Kubernetes recently added maxLimit = 10000

lavacat avatar Nov 19 '22 19:11 lavacat

Is this for Kubernetes use case or general case?

FYI, Kubernetes recently added maxLimit = 10000

Thanks for the link! Be default the object size is limited in 1.5MiB. The pageSize can be used to reduce memory spike. In kubernetes, it seems it can resolve this issue.

fuweid avatar Nov 20 '22 05:11 fuweid

Is this for Kubernetes use case or general case?

I am thinking of Kubernetes use case, but it definitely benefits general cases as well

FYI, Kubernetes recently added maxLimit = 10000

It seems somehow to mitigate the similar problem. AIUI, the intention of it is actually increase the response size to reduce the calls to etcd.

linxiulei avatar Nov 20 '22 10:11 linxiulei

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 18 '23 09:03 stale[bot]

@ahrtr can you please also try re-opening this?

linxiulei avatar Jul 25 '23 10:07 linxiulei

Note, not sure if large objects are as much of a concern when compared to large amounts of small objects. Response size can be controlled on user side as long as they limit maximal size of the object and number of objects per response.

Context https://github.com/kubernetes/enhancements/issues/2340#issuecomment-1640185697

Small objects are much more disruptful for clients. See the difference in CPU usage between "Disabled Large Objects" and "Disabled small objects" for apiserver. Difference is between 3 CPUs and 72 CPUs.

Overall I still think that long term better solution is to introduce watch cache to etcd client to get the same awesome benefits (2-10 times CPU reduction, 20-50 times latency reduction) instead of creating workaround.

serathius avatar Jul 26 '23 09:07 serathius

As far as the problem is concerned for this issue to address, the total size is most relevant as it results in proportional memory usage for etcd. Fixing this is a measure to avoid query-of-death (ie. OOM) more than a performance improvement.

Besides that, I don't see major conflict in having both long term and short term solution at the same time.

linxiulei avatar Jul 26 '23 10:07 linxiulei

Besides that, I don't see major conflict in having both long term and short term solution at the same time.

I expect Kubernetes will have consistent reads from cache before etcd releases v3.6. Getting watch cache out will not be fare out. Short term will be immediately deprecated.

serathius avatar Jul 26 '23 11:07 serathius

IMO, this pagination by size is still helpful after we have consistent reads in k8s because k8s still needs to request etcd with non-streaming requests which might be too large and explode etcd memory usage, right?

linxiulei avatar Sep 14 '23 10:09 linxiulei

No, goal is to serve non-streaming requests from watch cache.

serathius avatar Sep 14 '23 10:09 serathius

you mean apiserver's watch cache, right? how about when apiserver starts and gets all objects from etcd into its own cache?

linxiulei avatar Sep 14 '23 10:09 linxiulei

Single get all objects request is not a big problem, the problem are frequent fetch request done by badly written Kubernetes controllers. Such requests require a lot of allocations and waste time for proto marshalling/unmarshalling causing etcd memory to grow in uncontrolled way.

Note that all Kubernetes requests are sharded by resource, and Kubernetes limits size of response to 2GB, meaning that a initial request for all resources allocates maybe couple to tens of GB, it's big, but not horrible. Compare it to multiple controllers that fetch single 1GB resource every second.

serathius avatar Sep 14 '23 11:09 serathius

In 16300, the newly implemented maxbytes is defaulted to 0, which would not break any existing use case, and for cases where they do want to control the max size, this could be useful. Note that kube-apiserver maxlimit limits the object counts, but object with large size and more revisions would still cause problem.

I don't see a strong reason not to merge. @serathius Do you have any specific concerns?

wenjiaswe avatar Sep 14 '23 18:09 wenjiaswe

I don't have any concerns about compatibility, more about feature bloat. Based on my understanding of SIG api-machinery roadmap this feature will never be used by Kubernetes.

I don't think etcd should implement a feature for Kubernetes without having a prior design and approval from Kubernetes stakeholders. Before we add work for ourselves let's get LGTMs from SIG api-machinery.

cc @jpbetz @deads @logicalhan @wojtek-t

serathius avatar Sep 18 '23 10:09 serathius

@linxiulei I share the same understanding as https://github.com/etcd-io/etcd/issues/14809#issuecomment-1723136129. Please raise a KEP on Kubernetes and get K8s's approval firstly. I don't think etcd community will reject a feature which is useful and approved by K8s.

ahrtr avatar Sep 18 '23 11:09 ahrtr

Turns out we have a case where we ran out of the size aspect of the objects. We get the logs from apiserver:

I0821 23:03:24.662977      17 trace.go:205] Trace[419549052]: "List(recursive=true) etcd3" key:/secrets,resourceVersion:,resourceVersionMatch:, limit:10000,continue: (21-Aug-2023 23:03:23.767) (total time: 895 ms): Trace[419549052]: [895.681318ms] [895.681318ms] END
W0821 23:03:24.662996      17 reflector.go:324] storage/cacher.go:/secrets: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs.  2147483647)
E0821 23:03:24.663007      17 cacher.go:425] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647); reinitializing...

{"level":"warn","ts":"2023-08-17T23:03:24.662Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00357ae00/10.1.1.3:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647)"}

Imagine a case where there's just a load of big secrets. I'm still figuring out whether this is caused by badly written operators :)

The GRPC limit we can't really change, because it's already int32.Max: https://github.com/grpc/grpc-go/blob/8cb98464e5999aa2fd57bbf5b23bd5a634f4b2f5/server.go#L59

We should enable some kind of paging for range responses, but I agree that this needs to go the KEP route first.

tjungblu avatar Oct 27 '23 12:10 tjungblu

As a start, I've quickly added a metric that allows us to alert on common large range requests (secrets / configmaps / image streams) with #16881.

tjungblu avatar Nov 07 '23 14:11 tjungblu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 18 '25 00:06 github-actions[bot]