Performance improvements for `kustomize build`
We have a configrepo which produces about 4000 kubernetes resources, and expect three times that number in september. kustomize build currently takes around 45 seconds on a developer laptop. We run kustomize build against master and a branch for CI to find the impact of a change, and flux also runs kustomize build. In CI this takes about 1.5 minutes, so in total we see 7.5 minutes of compute time to roll out one change.
Since this is just under 1.4 MB of YAML I believe it should be possible to do this work much faster.
I have made four PRs with performance improvements. I have made the PRs as small as possible, but I believe some changes might be nicer with larger refactorings to make the new code more resilient to changes (resWrangler id map, PR https://github.com/kubernetes-sigs/kustomize/pull/5081 ). 5081+5082 might also need more tests, please provide feedback in each PR with suggested improvements as I don't know the kustomize codebase.
Here's a summary of the proposed changes:
- https://github.com/kubernetes-sigs/kustomize/pull/5425 - add benchmark
- https://github.com/kubernetes-sigs/kustomize/pull/5079 - 40.68s to 27.41s
- https://github.com/kubernetes-sigs/kustomize/pull/5080 - 27.46s to 18.94s
- https://github.com/kubernetes-sigs/kustomize/pull/5081 - ~~18.94s to 4.79s~~ - will do this another way
- https://github.com/kubernetes-sigs/kustomize/pull/5427 - the other way
- https://github.com/kubernetes-sigs/kustomize/pull/5082 - 4.79s to 1.82s
pprof before changes:

after it is mostly YAML parsing and GC.
Why is this needed?
Faster linting of PRs, quicker reconcile in flux
Can you accomplish the motivating task without this feature, and if so, how?
Splitting into smaller repos might help, but it will not allow us to analyze the whole service mesh graph and interactions between services/configurations.
What other solutions have you considered?
N/A
Anything else we should know?
No response
Feature ownership
- [X] I am interested in contributing this feature myself! 🎉
We will happily accept performance improvements. Since you've already opened some PRs, we can discuss specific changes on those PRs.
/triage accepted
Apologies for the delay - we will try to have people review your PRs soon!
Another thought here would be to join the lookups done by Resource.CurId(). It calls Resource.GetGvk (which calls RNode.GetApiVersion and RNode.GetKind), Resource.GetName, and Resource.GetNamespace.
That means there are four independent traversals at the top level (apiVersion, kind, and metadata twice). Then, in metadata, there are two independent traversals (name and namespace).
This flow could be optimized for performance so it would have a single execution to find apiVersion, kind, and metadata, and then a single execution to find name and namespace within metadata.
Another thought here would be to join the lookups done by Resource.CurId().
I took a little time to investigate this option, and the improvement was minor at best - not really worth the effort.
I've started to investigate what might be possible with caching CurId() in the Resource. Given that caching isn't already in place, I'm a little worried about complexities I may find with cache invalidation. I had hoped to lean on the code path that updates the list of previous ids, but there appear to be gaps (for example, api/krusty/component_test.go's TestComponent/multiple-components fails - I don't yet know if that indicates a flaw in the existing code where the previous id list should be updated but is not, or if that indicates a flaw in my hope that any change in CurId() should be associated with an update to the list of previous ids). I will continue investigating.
Initial results for caching the Resource.CurId() return value are very promising. I hooked cache invalidation into Resource.setPreviousId() and resWrangler.appendReplaceOrMerge()'s case 1 for replace and merge, and that looks to cover the unit test cases. Note that there are a small number of unit tests that cannot run cleanly on my system, so I may have gaps there.
@natasha41575 (and others), before I consider moving forward with this change, do you know if there are reasons why caching Resource.CurId() could be problematic? I feel like there may be hidden pitfalls here. Is this sort of caching in line with kustomize coding patterns? In addition to resWrangler.appendReplaceOrMerge, are there other spots that might adjust a resource in a way that could alter its ResId, but do not issue a call to Resource.setPreviousId()? Anything other aspect I might be missing?
I did some testing using the benchmark from #5425. However, I didn't want to wait forever so I adjusted the second level resource count from 100 down to 20.
$ git log --oneline -1
e219b8864 (HEAD -> master, upstream/master, origin/master, origin/HEAD) Merge pull request #5421 from prashantrewar/swap-yaml-library
$ go test ./kustomize/commands/build -run nope -bench BenchmarkBuild -benchmem
goos: linux
goarch: amd64
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
BenchmarkBuild-2 1 149546683100 ns/op 2276899072 B/op 21421892 allocs/op
PASS
ok sigs.k8s.io/kustomize/kustomize/v5/commands/build 149.598s
$ git checkout optimize-curid
M go.work.sum
Switched to branch 'optimize-curid'
Your branch is ahead of 'master' by 1 commit.
(use "git push" to publish your local commits)
$ go test ./kustomize/commands/build -run nope -bench BenchmarkBuild -benchmem
goos: linux
goarch: amd64
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
BenchmarkBuild-2 1 12183853800 ns/op 2280974424 B/op 21462373 allocs/op
PASS
ok sigs.k8s.io/kustomize/kustomize/v5/commands/build 12.236s
$
Initial results for caching the Resource.CurId() return value are very promising.
I went ahead and created #5481 for this effort. I still have some concerns about what other code paths might need to issue the call to invalidate the ID caches, but after discussion in https://github.com/kubernetes-sigs/kustomize/issues/5422#issuecomment-1845401025 I figured it was worth sharing the work. I don't know of any other spots in the code that would need the additions, so there's not much benefit in me keeping the PR private.
So far, no performance changes against v5.3.0 (multiple invocations scenario):
Starting kustomize benchmark on Linux x86_64
kustomize versions:
5.2.1
5.3.0
PR-5481
iterations per test: 200
tests:
1_no-patches
2_patches-json6902
3_patches-strategic-merge
4_no-patches-unknown-kind
5_component-no-base-no-patches
6_component-json6902-over-base
7_component-PSM-over-base
time unit: seconds
test: 1 test: 2 test: 3 test: 4 test: 5 test: 6 test: 7
v5.2.1 1.37 1.52 11.96 1.37 2.25 3.10 13.40
v5.3.0 1.33 1.47 12.29 1.34 1.54 1.92 12.34
PR-5481 1.37 1.46 12.19 1.30 1.55 1.81 12.29
@shapirus, I'm not surprised #5481 left the PSM performance (#5422) as-is. The optimization in #5481 was aimed at #5425, and in that context it has dramatic improvements. If I can carve some time in the next few weeks, I'll take a close look at #5422 and update there with any useful information I can find.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
@shapirus, I'm not surprised #5481 left the PSM performance (#5422) as-is. The optimization in #5481 was aimed at #5425, and in that context it has dramatic improvements. If I can carve some time in the next few weeks, I'll take a close look at #5422 and update there with any useful information I can find.
@ephesused have you had any further time/enthusiasm to look into this? :) PSM performance still remains worse than in the old v3.5.4, whereas the rest of my test cases showed improvements, sometimes pretty significant.
If only we could improve PSM...
/remove-lifecycle rotten
@ephesused have you had any further time/enthusiasm to look into this? :)
I have not had additional time, no. Assuming there haven't been major changes in this area, I suspect a key part of the performance continues to be the default schema load, as suspected here and investigated here.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale