kubectl
kubectl copied to clipboard
kubectl uses far more memory than expected
We use kubectl in some jobs to check on the cluster. However, we have found that it can be rather costly in memory use in either small or large clusters. Even more surprising, getting the results as YAML costs more than twice the memory as getting the results as JSON (even though the YAML output is usually half the size of the JSON output)
What happened:
First, the amount of memory used has gone up over the versions of kubernetes - significant increase in 1.18 vs 1.15 (the prior time I looked).
However, there is obviously something wrong with kubectl since the amount of use is far beyond reasonable.
Using the /usr/bin/time command to measure process resource consumption when doing these commands to a cluster we have and using wc to show the number of characters created in the output we see that the number of characters of yaml output is 479,416,955 and the amount of memory used by kubectl is 4,143,732K (4 gigabytes!) or around 9 times the size of the output!
What is worse is when doing the same with yaml we get noticeably smaller 233,672,375 characters of output (which is due to yaml being far more compact representation) and yet we consume 14,683,072K (yes, 14 gigabytes!) which is almost 70 times as much memory as the output and over 3 times as much memory as producing the same data in JSON!
$ /usr/bin/time -v kubectl get pods --all-namespaces --output json | wc --chars
Command being timed: "kubectl get pods --all-namespaces --output json"
User time (seconds): 74.95
System time (seconds): 1.75
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:17.59
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4143732
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 169
Minor (reclaiming a frame) page faults: 42114
Voluntary context switches: 48882
Involuntary context switches: 2159
Swaps: 0
File system inputs: 38976
File system outputs: 96
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
479416955
vs
$ /usr/bin/time -v kubectl get pods --all-namespaces --output yaml | wc --chars
Command being timed: "kubectl get pods --all-namespaces --output yaml"
User time (seconds): 166.93
System time (seconds): 16.21
Percent of CPU this job got: 121%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:30.86
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 14683072
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 61
Minor (reclaiming a frame) page faults: 1607011
Voluntary context switches: 44447
Involuntary context switches: 6393
Swaps: 0
File system inputs: 4680
File system outputs: 96
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
233672375
What you expected to happen: I would expect that this output would be produced at the most at around 1x of data size since serialization of the fields has a lot of repeated elements (field names) that should not be repeated in memory.
Even worse was the drastic jump in memory consumption based on output format! Going to 14 Gigabytes of peak RAM use for yaml output!
For contrast, see these two calls (same thing) to a much, much smaller cluster:
JSON output size: 3,385,403 while RAM used is 69,528K YAML output size: 1,608,293 while RAM used is 182,472K
The YAML output is nearly 3 times the memory used by the kubectl process (and again 1/2 the output size)
$ /usr/bin/time -v kubectl get pods --all-namespaces --output json | wc --chars
Command being timed: "kubectl get pods --all-namespaces --output json"
User time (seconds): 0.69
System time (seconds): 0.02
Percent of CPU this job got: 108%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.67
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 69528
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 7
Minor (reclaiming a frame) page faults: 1530
Voluntary context switches: 1115
Involuntary context switches: 81
Swaps: 0
File system inputs: 1064
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
3385403
vs
$ /usr/bin/time -v kubectl get pods --all-namespaces --output yaml | wc --chars
Command being timed: "kubectl get pods --all-namespaces --output yaml"
User time (seconds): 1.36
System time (seconds): 0.13
Percent of CPU this job got: 129%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.15
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 182472
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 7
Minor (reclaiming a frame) page faults: 2616
Voluntary context switches: 1058
Involuntary context switches: 72
Swaps: 0
File system inputs: 1224
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
1608293
How to reproduce it (as minimally and precisely as possible): See above - the cluster above is not tiny - this effect is much more impactful on larger clusters but even on small clusters the memory consumption is far larger than the size of the data.
Anything else we need to know?: We have small periodic jobs that use the kubectl API to do some work. We like the abstraction this gives us from the underlying APIs for these higher level jobs. They have also worked across Kubernetes versions, other than the resource requirements.
We had already switched to json output and processing due to kubectl's large memory footprint for yaml but have again hit issues with the large memory footprint (over 4GB just for kubectl process).
Environment:
-
Kubernetes 1.18.6 and kubectl 1.18.6
- Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
- Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T19:08:46Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
-
Running in Azure
-
OS Ubuntu 18.04.5 LTS (workstation and all cluster VMs)
@Michael-Sinz thanks for reporting!
I have a feeling this is related to https://github.com/kubernetes/kubectl/issues/866.
For triage sake could you please test this with v1.19.3 and the v1.20 beta since you have access to a large cluster?
https://storage.googleapis.com/kubernetes-release/release/v1.19.3/bin/linux/amd64/kubectl https://storage.googleapis.com/kubernetes-release/release/v1.20.0-beta.1/bin/linux/amd64/kubectl
A quick test (will do more tomorrow) shows that 1.20.0-beta.1 is actually slightly more memory use (very small) but I need to get access to the large production clusters to do a full verification as it is after business hours now and things are scaling down from their peak usage. (We see 4x to 10x scale variation from low to high during the day.)
Also still seeing the significantly higher memory use to output yaml than to output json (multiples of size)
I have run the 1.20.0-beta.1 and 1.19.3 now against a large cluster. Note that the cluster is rather dynamic so the runs are not of the exact same state across the runs. I also could not complete the yaml output as 1.20.0-beta.1 ran out of memory on my access VM (which has 16GBytes of RAM).
In json it does complete but it is still over 9x the memory compared to the output. (7.3GBytes of RSS for 781MBytes of output in json)
$ /usr/bin/time -v ./kubectl-1.20.0-beta.1 get pods --all-namespaces -o json | wc --chars
Command being timed: "./kubectl-1.20.0-beta.1 get pods --all-namespaces -o json"
User time (seconds): 111.33
System time (seconds): 4.89
Percent of CPU this job got: 93%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:04.61
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7367652
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 253
Minor (reclaiming a frame) page faults: 74874
Voluntary context switches: 130416
Involuntary context switches: 3827
Swaps: 0
File system inputs: 68680
File system outputs: 1440
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
781719220
However, based on minor testing, it seems that 1.20 is better than 1.19:
$ /usr/bin/time -v ./kubectl-1.19.3 get pods --all-namespaces -o json | wc --chars
Command being timed: "./kubectl-1.19.3 get pods --all-namespaces -o json"
User time (seconds): 109.72
System time (seconds): 4.85
Percent of CPU this job got: 82%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:18.09
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7662328
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 254
Minor (reclaiming a frame) page faults: 77365
Voluntary context switches: 117720
Involuntary context switches: 2963
Swaps: 0
File system inputs: 65032
File system outputs: 1440
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
779475036
Just for comparison, the kubectl 1.18.6 run against the cluster. Note that it seems to use less memory than 1.19 or 1.20 (but, again, conditions continue to change in the cluster. I would look at output size vs RSS used as a measure at this scale)
$ /usr/bin/time -v kubectl get pods --all-namespaces -o json | wc --chars
Command being timed: "kubectl get pods --all-namespaces -o json"
User time (seconds): 119.21
System time (seconds): 2.91
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:11.32
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6952484
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 38
Minor (reclaiming a frame) page faults: 60523
Voluntary context switches: 81172
Involuntary context switches: 2872
Swaps: 0
File system inputs: 6184
File system outputs: 96
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
757897460
This came up in our sig-cli meeting today.
This might be addressed by not sorting the data or streaming the output as it comes in.
We currently toss everything into a kind: List
which is a type of mixed resources. This wasn't really intended to be used externally (#984).
/triage accepted /priority backlog
I do find it interesting that the memory use is so far more than the size of the JSON (and even worse, the YAML - still unclear how we can used that much more memory just because the output is YAML)
While the kind: List may be a part of this, I think the bigger problem is more fundamental.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
As far as I know, this issue is still live - the amount of memory used to get the complete pod status of the cluster is amazingly large for non-trivial clusters. Far more than expected given the size of the (rather verbose) json formatted output.
/remove-lifecycle stale
@Michael-Sinz turns out we have some profiling built-in to kubectl.
Can you please run the following and upload the dumps?
kubectl get pods -A --profile allocs --profile-output allocs-plain.pprof > /dev/null
kubectl get pods -A --profile heap --profile-output heap-plain.pprof > /dev/null
kubectl get pods -A --profile cpu --profile-output cpu-plain.pprof > /dev/null
kubectl get pods -A -o json --profile allocs --profile-output allocs-json.pprof > /dev/null
kubectl get pods -A -o json --profile heap --profile-output heap-json.pprof > /dev/null
kubectl get pods -A -o json --profile cpu --profile-output cpu-json.pprof > /dev/null
kubectl get pods -A -o yaml --profile allocs --profile-output allocs-yaml.pprof > /dev/null
kubectl get pods -A -o yaml --profile heap --profile-output heap-yaml.pprof > /dev/null
kubectl get pods -A -o yaml --profile cpu --profile-output cpu-yaml.pprof > /dev/null
If you want to take a peak at what you're uploading you can run go tool pprof -top allocs-json.pprof
.
I will have to see what I can poke at. I assume no details about the cluster will be in the profile reports, right? (I will take a look at the code in a bit)
I may not have the same size cluster available right now - we dynamically scale rather drastically all the time (up and down) so the overall scale of the cluster changes. (In addition to moving various workloads between different clusters)
But I should have some interesting files to post soon.
I assume no details about the cluster will be in the profile reports, right?
From what I can tell.
Here's the proto it writes out.
Note that our clusters are relatively dynamic so I can not assure you that the same state exists across all of the runs.
I did two clusters - a smaller one and a larger one. The larger one took a long time to complete and I had to use an different machine to run the test on it as the memory use was too high for my normal dev VM. (It literally was OOM killed by the kernel on my regular dev VM)
I hope the data provides what you need. A quick look at the allocs shows a major cost for yaml. Almost like it produces the json in memory and then produces the yaml. Both show a rather large amount of memory usage.
json:
Showing nodes accounting for 12527.31MB, 88.12% of 14216.50MB total
Dropped 221 nodes (cum <= 71.08MB)
Showing top 10 nodes out of 106
flat flat% sum% cum cum%
3682.55MB 25.90% 25.90% 4989.50MB 35.10% encoding/json.(*decodeState).objectInterface
2312.24MB 16.26% 42.17% 2312.24MB 16.26% encoding/json.(*Decoder).refill
2112.20MB 14.86% 57.03% 2112.20MB 14.86% bytes.makeSlice
865.84MB 6.09% 63.12% 865.84MB 6.09% encoding/json.(*RawMessage).UnmarshalJSON
833.08MB 5.86% 68.98% 833.08MB 5.86% reflect.mapiterinit
761.46MB 5.36% 74.33% 937.93MB 6.60% encoding/json.unquote
758.13MB 5.33% 79.66% 3121.75MB 21.96% encoding/json.mapEncoder.encode
436.69MB 3.07% 82.74% 2676.60MB 18.83% encoding/json.Marshal
390.51MB 2.75% 85.48% 390.51MB 2.75% reflect.copyVal
374.62MB 2.64% 88.12% 481.12MB 3.38% k8s.io/kubernetes/vendor/github.com/imdario/mergo.deepMerge
yaml:
Showing nodes accounting for 23816.33MB, 80.09% of 29735.36MB total
Dropped 233 nodes (cum <= 148.68MB)
Showing top 10 nodes out of 148
flat flat% sum% cum cum%
9192.17MB 30.91% 30.91% 9821.80MB 33.03% k8s.io/kubernetes/vendor/gopkg.in/yaml%2ev2.yaml_emitter_emit
3760.05MB 12.65% 43.56% 5028.68MB 16.91% encoding/json.(*decodeState).objectInterface
2518.07MB 8.47% 52.03% 2518.07MB 8.47% bytes.makeSlice
2270.52MB 7.64% 59.66% 2270.52MB 7.64% encoding/json.(*Decoder).refill
1653.68MB 5.56% 65.22% 1653.68MB 5.56% k8s.io/kubernetes/vendor/gopkg.in/yaml%2ev2.(*parser).node
1144.60MB 3.85% 69.07% 1144.60MB 3.85% reflect.mapiterinit
881.16MB 2.96% 72.04% 3468.13MB 11.66% encoding/json.mapEncoder.encode
876.60MB 2.95% 74.98% 876.60MB 2.95% encoding/json.(*RawMessage).UnmarshalJSON
760.70MB 2.56% 77.54% 918.10MB 3.09% encoding/json.unquote
758.77MB 2.55% 80.09% 758.77MB 2.55% reflect.mapassign
/cc
go tool pprof -png -nodecount 10000 k8s.profile.large/allocs-yaml.pprof
whoa! @eddiezane
cc @liggitt
Almost like it produces the json in memory and then produces the yaml.
This is correct - in Kubernetes yaml is produced by translation from json. Ironic, for a system that has so many complaints about yaml, that it never actually uses it internally.
Note that Go will typically make a heap twice as big as it needs for your data; you can adjust this with the GOGC
environment variable but only so far.
I tend to agree that #866 is the real underlying problem.
I think #866 is part of the problem but there must be some serious inefficiencies when even the textual form of json (which is rather verbose and indented/etc) is many times smaller than the memory needed to produce it. Something very wrong is going on when that is the case. (Not to mention that the yaml is a translation of the json and not a serialization of the same objects that produced the json - I don't really care since I can consume the json just as well as the yaml)
RAM used: 6,952,484K Size of json output: 757,897K
Where is the extra memory being used? Even if this is 2x the actual used, we are still at 3,476,242K of memory used. Where did that extra 2.4GB of RAM get used over the text form of the json data? How much additional data or structures would be needed to do this given the fact that the json has so much redundant data in it due to the k:v syntax.
there must be some serious inefficiencies when even the textual form of json (which is rather verbose and indented/etc) is many times smaller than the memory needed to produce it
that's not surprising at all. a json map can be represented in two bytes: {}
. a non-nil empty map[string]interface{}
object takes much more memory than that.
there must be some serious inefficiencies when even the textual form of json (which is rather verbose and indented/etc) is many times smaller than the memory needed to produce it
that's not surprising at all. a json map can be represented in two bytes:
{}
. a non-nil emptymap[string]interface{}
object takes much more memory than that.
First, we are not talking about empty set. They are special cases and in both cases are harder to measure. Once you have thousands of objects in there, things should look different.
With a few thousand pods, one would have 'metadata' key (and all of the other keys in the maps) a few thousand times but they don't need to be duplicated across the maps. And yet, in the actual json output they are duplicated due to the format.
When I wrote parsers and data storage for things like this, the keys were always mapped into unique entries. Only the customer data may wish to be unique each time (and that too is false in kubernetes since there is a lot of duplication there too) so I would look at string tables or unique-string types of solutions. Especially since strings are usually seen as immutable - you replace the whole thing, not edit in place, so this works out amazingly well.
We should be able to get closer to the actual information content (information theory) in memory and be faster and more efficient because of it since you will not bust the caches so hard. (Or make the working set so high)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
- Confirm that this issue is still relevant with
/triage accepted
(org members only) - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
Any updates on this issue? I've run into this problem as well. Getting ~600 resources within a namespace uses 400MB, the resulting yaml is 22MB, json is 31MB. Only getting the names and then separately getting each manifest will take too long, so that isn't an option.
/help
@pacoxu: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.