kubectl icon indicating copy to clipboard operation
kubectl copied to clipboard

kubectl uses far more memory than expected

Open Michael-Sinz opened this issue 4 years ago • 30 comments

We use kubectl in some jobs to check on the cluster. However, we have found that it can be rather costly in memory use in either small or large clusters. Even more surprising, getting the results as YAML costs more than twice the memory as getting the results as JSON (even though the YAML output is usually half the size of the JSON output)

What happened:
First, the amount of memory used has gone up over the versions of kubernetes - significant increase in 1.18 vs 1.15 (the prior time I looked).

However, there is obviously something wrong with kubectl since the amount of use is far beyond reasonable.

Using the /usr/bin/time command to measure process resource consumption when doing these commands to a cluster we have and using wc to show the number of characters created in the output we see that the number of characters of yaml output is 479,416,955 and the amount of memory used by kubectl is 4,143,732K (4 gigabytes!) or around 9 times the size of the output!

What is worse is when doing the same with yaml we get noticeably smaller 233,672,375 characters of output (which is due to yaml being far more compact representation) and yet we consume 14,683,072K (yes, 14 gigabytes!) which is almost 70 times as much memory as the output and over 3 times as much memory as producing the same data in JSON!

$ /usr/bin/time -v kubectl get pods --all-namespaces --output json | wc --chars
        Command being timed: "kubectl get pods --all-namespaces --output json"
        User time (seconds): 74.95
        System time (seconds): 1.75
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:17.59
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4143732
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 169
        Minor (reclaiming a frame) page faults: 42114
        Voluntary context switches: 48882
        Involuntary context switches: 2159
        Swaps: 0
        File system inputs: 38976
        File system outputs: 96
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
479416955

vs

$ /usr/bin/time -v kubectl get pods --all-namespaces --output yaml | wc --chars
        Command being timed: "kubectl get pods --all-namespaces --output yaml"
        User time (seconds): 166.93
        System time (seconds): 16.21
        Percent of CPU this job got: 121%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:30.86
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 14683072
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 61
        Minor (reclaiming a frame) page faults: 1607011
        Voluntary context switches: 44447
        Involuntary context switches: 6393
        Swaps: 0
        File system inputs: 4680
        File system outputs: 96
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
233672375

What you expected to happen: I would expect that this output would be produced at the most at around 1x of data size since serialization of the fields has a lot of repeated elements (field names) that should not be repeated in memory.

Even worse was the drastic jump in memory consumption based on output format! Going to 14 Gigabytes of peak RAM use for yaml output!

For contrast, see these two calls (same thing) to a much, much smaller cluster:

JSON output size: 3,385,403 while RAM used is 69,528K YAML output size: 1,608,293 while RAM used is 182,472K

The YAML output is nearly 3 times the memory used by the kubectl process (and again 1/2 the output size)

$ /usr/bin/time -v kubectl get pods --all-namespaces --output json | wc --chars
        Command being timed: "kubectl get pods --all-namespaces --output json"
        User time (seconds): 0.69
        System time (seconds): 0.02
        Percent of CPU this job got: 108%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.67
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 69528
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 7
        Minor (reclaiming a frame) page faults: 1530
        Voluntary context switches: 1115
        Involuntary context switches: 81
        Swaps: 0
        File system inputs: 1064
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
3385403

vs

$ /usr/bin/time -v kubectl get pods --all-namespaces --output yaml | wc --chars
        Command being timed: "kubectl get pods --all-namespaces --output yaml"
        User time (seconds): 1.36
        System time (seconds): 0.13
        Percent of CPU this job got: 129%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.15
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 182472
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 7
        Minor (reclaiming a frame) page faults: 2616
        Voluntary context switches: 1058
        Involuntary context switches: 72
        Swaps: 0
        File system inputs: 1224
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
1608293

How to reproduce it (as minimally and precisely as possible): See above - the cluster above is not tiny - this effect is much more impactful on larger clusters but even on small clusters the memory consumption is far larger than the size of the data.

Anything else we need to know?: We have small periodic jobs that use the kubectl API to do some work. We like the abstraction this gives us from the underlying APIs for these higher level jobs. They have also worked across Kubernetes versions, other than the resource requirements.

We had already switched to json output and processing due to kubectl's large memory footprint for yaml but have again hit issues with the large memory footprint (over 4GB just for kubectl process).

Environment:

  • Kubernetes 1.18.6 and kubectl 1.18.6

    • Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
    • Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T19:08:46Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
  • Running in Azure

  • OS Ubuntu 18.04.5 LTS (workstation and all cluster VMs)

Michael-Sinz avatar Nov 10 '20 13:11 Michael-Sinz

@Michael-Sinz thanks for reporting!

I have a feeling this is related to https://github.com/kubernetes/kubectl/issues/866.

For triage sake could you please test this with v1.19.3 and the v1.20 beta since you have access to a large cluster?

https://storage.googleapis.com/kubernetes-release/release/v1.19.3/bin/linux/amd64/kubectl https://storage.googleapis.com/kubernetes-release/release/v1.20.0-beta.1/bin/linux/amd64/kubectl

eddiezane avatar Nov 11 '20 01:11 eddiezane

A quick test (will do more tomorrow) shows that 1.20.0-beta.1 is actually slightly more memory use (very small) but I need to get access to the large production clusters to do a full verification as it is after business hours now and things are scaling down from their peak usage. (We see 4x to 10x scale variation from low to high during the day.)

Also still seeing the significantly higher memory use to output yaml than to output json (multiples of size)

Michael-Sinz avatar Nov 12 '20 23:11 Michael-Sinz

I have run the 1.20.0-beta.1 and 1.19.3 now against a large cluster. Note that the cluster is rather dynamic so the runs are not of the exact same state across the runs. I also could not complete the yaml output as 1.20.0-beta.1 ran out of memory on my access VM (which has 16GBytes of RAM).

In json it does complete but it is still over 9x the memory compared to the output. (7.3GBytes of RSS for 781MBytes of output in json)

$ /usr/bin/time -v ./kubectl-1.20.0-beta.1 get pods --all-namespaces -o json | wc --chars
	Command being timed: "./kubectl-1.20.0-beta.1 get pods --all-namespaces -o json"
	User time (seconds): 111.33
	System time (seconds): 4.89
	Percent of CPU this job got: 93%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:04.61
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 7367652
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 253
	Minor (reclaiming a frame) page faults: 74874
	Voluntary context switches: 130416
	Involuntary context switches: 3827
	Swaps: 0
	File system inputs: 68680
	File system outputs: 1440
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
781719220

However, based on minor testing, it seems that 1.20 is better than 1.19:

$ /usr/bin/time -v ./kubectl-1.19.3 get pods --all-namespaces -o json | wc --chars
	Command being timed: "./kubectl-1.19.3 get pods --all-namespaces -o json"
	User time (seconds): 109.72
	System time (seconds): 4.85
	Percent of CPU this job got: 82%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:18.09
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 7662328
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 254
	Minor (reclaiming a frame) page faults: 77365
	Voluntary context switches: 117720
	Involuntary context switches: 2963
	Swaps: 0
	File system inputs: 65032
	File system outputs: 1440
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
779475036

Just for comparison, the kubectl 1.18.6 run against the cluster. Note that it seems to use less memory than 1.19 or 1.20 (but, again, conditions continue to change in the cluster. I would look at output size vs RSS used as a measure at this scale)

$ /usr/bin/time -v kubectl get pods --all-namespaces -o json | wc --chars
	Command being timed: "kubectl get pods --all-namespaces -o json"
	User time (seconds): 119.21
	System time (seconds): 2.91
	Percent of CPU this job got: 92%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:11.32
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6952484
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 38
	Minor (reclaiming a frame) page faults: 60523
	Voluntary context switches: 81172
	Involuntary context switches: 2872
	Swaps: 0
	File system inputs: 6184
	File system outputs: 96
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
757897460

Michael-Sinz avatar Nov 16 '20 16:11 Michael-Sinz

This came up in our sig-cli meeting today.

This might be addressed by not sorting the data or streaming the output as it comes in.

We currently toss everything into a kind: List which is a type of mixed resources. This wasn't really intended to be used externally (#984).

/triage accepted /priority backlog

eddiezane avatar Dec 10 '20 01:12 eddiezane

I do find it interesting that the memory use is so far more than the size of the JSON (and even worse, the YAML - still unclear how we can used that much more memory just because the output is YAML)

While the kind: List may be a part of this, I think the bigger problem is more fundamental.

Michael-Sinz avatar Dec 10 '20 17:12 Michael-Sinz

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Mar 10 '21 17:03 fejta-bot

As far as I know, this issue is still live - the amount of memory used to get the complete pod status of the cluster is amazingly large for non-trivial clusters. Far more than expected given the size of the (rather verbose) json formatted output.

/remove-lifecycle stale

Michael-Sinz avatar Mar 10 '21 19:03 Michael-Sinz

@Michael-Sinz turns out we have some profiling built-in to kubectl.

Can you please run the following and upload the dumps?

kubectl get pods -A --profile allocs --profile-output allocs-plain.pprof > /dev/null
kubectl get pods -A --profile heap --profile-output heap-plain.pprof > /dev/null
kubectl get pods -A --profile cpu --profile-output cpu-plain.pprof > /dev/null

kubectl get pods -A -o json --profile allocs --profile-output allocs-json.pprof > /dev/null
kubectl get pods -A -o json --profile heap --profile-output heap-json.pprof > /dev/null
kubectl get pods -A -o json --profile cpu --profile-output cpu-json.pprof > /dev/null

kubectl get pods -A -o yaml --profile allocs --profile-output allocs-yaml.pprof > /dev/null
kubectl get pods -A -o yaml --profile heap --profile-output heap-yaml.pprof > /dev/null
kubectl get pods -A -o yaml --profile cpu --profile-output cpu-yaml.pprof > /dev/null

If you want to take a peak at what you're uploading you can run go tool pprof -top allocs-json.pprof.

eddiezane avatar Mar 11 '21 20:03 eddiezane

I will have to see what I can poke at. I assume no details about the cluster will be in the profile reports, right? (I will take a look at the code in a bit)

I may not have the same size cluster available right now - we dynamically scale rather drastically all the time (up and down) so the overall scale of the cluster changes. (In addition to moving various workloads between different clusters)

But I should have some interesting files to post soon.

Michael-Sinz avatar Mar 11 '21 23:03 Michael-Sinz

I assume no details about the cluster will be in the profile reports, right?

From what I can tell.

Source.

Here's the proto it writes out.

eddiezane avatar Mar 11 '21 23:03 eddiezane

Note that our clusters are relatively dynamic so I can not assure you that the same state exists across all of the runs.

I did two clusters - a smaller one and a larger one. The larger one took a long time to complete and I had to use an different machine to run the test on it as the memory use was too high for my normal dev VM. (It literally was OOM killed by the kernel on my regular dev VM)

I hope the data provides what you need. A quick look at the allocs shows a major cost for yaml. Almost like it produces the json in memory and then produces the yaml. Both show a rather large amount of memory usage.

k8s.profile.zip

json:

Showing nodes accounting for 12527.31MB, 88.12% of 14216.50MB total
Dropped 221 nodes (cum <= 71.08MB)
Showing top 10 nodes out of 106
      flat  flat%   sum%        cum   cum%
 3682.55MB 25.90% 25.90%  4989.50MB 35.10%  encoding/json.(*decodeState).objectInterface
 2312.24MB 16.26% 42.17%  2312.24MB 16.26%  encoding/json.(*Decoder).refill
 2112.20MB 14.86% 57.03%  2112.20MB 14.86%  bytes.makeSlice
  865.84MB  6.09% 63.12%   865.84MB  6.09%  encoding/json.(*RawMessage).UnmarshalJSON
  833.08MB  5.86% 68.98%   833.08MB  5.86%  reflect.mapiterinit
  761.46MB  5.36% 74.33%   937.93MB  6.60%  encoding/json.unquote
  758.13MB  5.33% 79.66%  3121.75MB 21.96%  encoding/json.mapEncoder.encode
  436.69MB  3.07% 82.74%  2676.60MB 18.83%  encoding/json.Marshal
  390.51MB  2.75% 85.48%   390.51MB  2.75%  reflect.copyVal
  374.62MB  2.64% 88.12%   481.12MB  3.38%  k8s.io/kubernetes/vendor/github.com/imdario/mergo.deepMerge

yaml:

Showing nodes accounting for 23816.33MB, 80.09% of 29735.36MB total
Dropped 233 nodes (cum <= 148.68MB)
Showing top 10 nodes out of 148
      flat  flat%   sum%        cum   cum%
 9192.17MB 30.91% 30.91%  9821.80MB 33.03%  k8s.io/kubernetes/vendor/gopkg.in/yaml%2ev2.yaml_emitter_emit
 3760.05MB 12.65% 43.56%  5028.68MB 16.91%  encoding/json.(*decodeState).objectInterface
 2518.07MB  8.47% 52.03%  2518.07MB  8.47%  bytes.makeSlice
 2270.52MB  7.64% 59.66%  2270.52MB  7.64%  encoding/json.(*Decoder).refill
 1653.68MB  5.56% 65.22%  1653.68MB  5.56%  k8s.io/kubernetes/vendor/gopkg.in/yaml%2ev2.(*parser).node
 1144.60MB  3.85% 69.07%  1144.60MB  3.85%  reflect.mapiterinit
  881.16MB  2.96% 72.04%  3468.13MB 11.66%  encoding/json.mapEncoder.encode
  876.60MB  2.95% 74.98%   876.60MB  2.95%  encoding/json.(*RawMessage).UnmarshalJSON
  760.70MB  2.56% 77.54%   918.10MB  3.09%  encoding/json.unquote
  758.77MB  2.55% 80.09%   758.77MB  2.55%  reflect.mapassign

Michael-Sinz avatar Mar 12 '21 13:03 Michael-Sinz

/cc

dougsland avatar Mar 12 '21 23:03 dougsland

go tool pprof -png -nodecount 10000 k8s.profile.large/allocs-yaml.pprof

profile001

eddiezane avatar Mar 18 '21 00:03 eddiezane

whoa! @eddiezane

dims avatar Mar 18 '21 01:03 dims

cc @liggitt

dims avatar Mar 18 '21 01:03 dims

Almost like it produces the json in memory and then produces the yaml.

This is correct - in Kubernetes yaml is produced by translation from json. Ironic, for a system that has so many complaints about yaml, that it never actually uses it internally.

Note that Go will typically make a heap twice as big as it needs for your data; you can adjust this with the GOGC environment variable but only so far.

I tend to agree that #866 is the real underlying problem.

bboreham avatar Mar 18 '21 10:03 bboreham

I think #866 is part of the problem but there must be some serious inefficiencies when even the textual form of json (which is rather verbose and indented/etc) is many times smaller than the memory needed to produce it. Something very wrong is going on when that is the case. (Not to mention that the yaml is a translation of the json and not a serialization of the same objects that produced the json - I don't really care since I can consume the json just as well as the yaml)

RAM used: 6,952,484K Size of json output: 757,897K

Where is the extra memory being used? Even if this is 2x the actual used, we are still at 3,476,242K of memory used. Where did that extra 2.4GB of RAM get used over the text form of the json data? How much additional data or structures would be needed to do this given the fact that the json has so much redundant data in it due to the k:v syntax.

Michael-Sinz avatar Mar 18 '21 16:03 Michael-Sinz

there must be some serious inefficiencies when even the textual form of json (which is rather verbose and indented/etc) is many times smaller than the memory needed to produce it

that's not surprising at all. a json map can be represented in two bytes: {}. a non-nil empty map[string]interface{} object takes much more memory than that.

liggitt avatar Mar 18 '21 16:03 liggitt

there must be some serious inefficiencies when even the textual form of json (which is rather verbose and indented/etc) is many times smaller than the memory needed to produce it

that's not surprising at all. a json map can be represented in two bytes: {}. a non-nil empty map[string]interface{} object takes much more memory than that.

First, we are not talking about empty set. They are special cases and in both cases are harder to measure. Once you have thousands of objects in there, things should look different.

With a few thousand pods, one would have 'metadata' key (and all of the other keys in the maps) a few thousand times but they don't need to be duplicated across the maps. And yet, in the actual json output they are duplicated due to the format.

When I wrote parsers and data storage for things like this, the keys were always mapped into unique entries. Only the customer data may wish to be unique each time (and that too is false in kubernetes since there is a lot of duplication there too) so I would look at string tables or unique-string types of solutions. Especially since strings are usually seen as immutable - you replace the whole thing, not edit in place, so this works out amazingly well.

We should be able to get closer to the actual information content (information theory) in memory and be faster and more efficient because of it since you will not bust the caches so hard. (Or make the working set so high)

Michael-Sinz avatar Mar 31 '21 11:03 Michael-Sinz

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot avatar Jun 29 '21 12:06 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

k8s-triage-robot avatar Jul 29 '21 12:07 k8s-triage-robot

/remove-lifecycle rotten

Michael-Sinz avatar Aug 02 '21 14:08 Michael-Sinz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 31 '21 14:10 k8s-triage-robot

/remove-lifecycle rotten

Michael-Sinz avatar Nov 01 '21 17:11 Michael-Sinz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 31 '22 21:01 k8s-triage-robot

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Feb 08 '23 21:02 k8s-triage-robot

Any updates on this issue? I've run into this problem as well. Getting ~600 resources within a namespace uses 400MB, the resulting yaml is 22MB, json is 31MB. Only getting the names and then separately getting each manifest will take too long, so that isn't an option.

maartengo avatar Mar 30 '23 07:03 maartengo

/help

pacoxu avatar Nov 16 '23 14:11 pacoxu

@pacoxu: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 16 '23 14:11 k8s-ci-robot