kustomize Performance improvements for `kustomize build`

We have a configrepo which produces about 4000 kubernetes resources, and expect three times that number in september. kustomize build currently takes around 45 seconds on a developer laptop. We run kustomize build against master and a branch for CI to find the impact of a change, and flux also runs kustomize build. In CI this takes about 1.5 minutes, so in total we see 7.5 minutes of compute time to roll out one change.

Since this is just under 1.4 MB of YAML I believe it should be possible to do this work much faster.

I have made four PRs with performance improvements. I have made the PRs as small as possible, but I believe some changes might be nicer with larger refactorings to make the new code more resilient to changes (resWrangler id map, PR https://github.com/kubernetes-sigs/kustomize/pull/5081 ). 5081+5082 might also need more tests, please provide feedback in each PR with suggested improvements as I don't know the kustomize codebase.

Here's a summary of the proposed changes:

https://github.com/kubernetes-sigs/kustomize/pull/5425 - add benchmark
https://github.com/kubernetes-sigs/kustomize/pull/5079 - 40.68s to 27.41s
https://github.com/kubernetes-sigs/kustomize/pull/5080 - 27.46s to 18.94s
https://github.com/kubernetes-sigs/kustomize/pull/5081 - ~~18.94s to 4.79s~~ - will do this another way
https://github.com/kubernetes-sigs/kustomize/pull/5427 - the other way
https://github.com/kubernetes-sigs/kustomize/pull/5082 - 4.79s to 1.82s

pprof before changes:

after it is mostly YAML parsing and GC.

Why is this needed?

Faster linting of PRs, quicker reconcile in flux

Can you accomplish the motivating task without this feature, and if so, how?

Splitting into smaller repos might help, but it will not allow us to analyze the whole service mesh graph and interactions between services/configurations.

What other solutions have you considered?

N/A

Anything else we should know?

No response

Feature ownership

[X] I am interested in contributing this feature myself! 🎉

Mar 08 '23 11:03 chlunde

We will happily accept performance improvements. Since you've already opened some PRs, we can discuss specific changes on those PRs.

/triage accepted

Mar 22 '23 19:03 natasha41575

Apologies for the delay - we will try to have people review your PRs soon!

Aug 21 '23 21:08 natasha41575

Another thought here would be to join the lookups done by Resource.CurId(). It calls Resource.GetGvk (which calls RNode.GetApiVersion and RNode.GetKind), Resource.GetName, and Resource.GetNamespace.

That means there are four independent traversals at the top level (apiVersion, kind, and metadata twice). Then, in metadata, there are two independent traversals (name and namespace).

This flow could be optimized for performance so it would have a single execution to find apiVersion, kind, and metadata, and then a single execution to find name and namespace within metadata.

Nov 03 '23 16:11 ephesused

Another thought here would be to join the lookups done by Resource.CurId().

I took a little time to investigate this option, and the improvement was minor at best - not really worth the effort.

I've started to investigate what might be possible with caching CurId() in the Resource. Given that caching isn't already in place, I'm a little worried about complexities I may find with cache invalidation. I had hoped to lean on the code path that updates the list of previous ids, but there appear to be gaps (for example, api/krusty/component_test.go's TestComponent/multiple-components fails - I don't yet know if that indicates a flaw in the existing code where the previous id list should be updated but is not, or if that indicates a flaw in my hope that any change in CurId() should be associated with an update to the list of previous ids). I will continue investigating.

Nov 10 '23 14:11 ephesused

Initial results for caching the Resource.CurId() return value are very promising. I hooked cache invalidation into Resource.setPreviousId() and resWrangler.appendReplaceOrMerge()'s case 1 for replace and merge, and that looks to cover the unit test cases. Note that there are a small number of unit tests that cannot run cleanly on my system, so I may have gaps there.

@natasha41575 (and others), before I consider moving forward with this change, do you know if there are reasons why caching Resource.CurId() could be problematic? I feel like there may be hidden pitfalls here. Is this sort of caching in line with kustomize coding patterns? In addition to resWrangler.appendReplaceOrMerge, are there other spots that might adjust a resource in a way that could alter its ResId, but do not issue a call to Resource.setPreviousId()? Anything other aspect I might be missing?

I did some testing using the benchmark from #5425. However, I didn't want to wait forever so I adjusted the second level resource count from 100 down to 20.

$ git log --oneline -1
e219b8864 (HEAD -> master, upstream/master, origin/master, origin/HEAD) Merge pull request #5421 from prashantrewar/swap-yaml-library

$ go test ./kustomize/commands/build -run nope -bench BenchmarkBuild -benchmem
goos: linux
goarch: amd64
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
BenchmarkBuild-2               1        149546683100 ns/op      2276899072 B/op 21421892 allocs/op
PASS
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       149.598s

$ git checkout optimize-curid
M       go.work.sum
Switched to branch 'optimize-curid'
Your branch is ahead of 'master' by 1 commit.
  (use "git push" to publish your local commits)

$ go test ./kustomize/commands/build -run nope -bench BenchmarkBuild -benchmem
goos: linux
goarch: amd64
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
BenchmarkBuild-2               1        12183853800 ns/op       2280974424 B/op 21462373 allocs/op
PASS
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       12.236s

$

Nov 10 '23 21:11 ephesused

Initial results for caching the Resource.CurId() return value are very promising.

I went ahead and created #5481 for this effort. I still have some concerns about what other code paths might need to issue the call to invalidate the ID caches, but after discussion in https://github.com/kubernetes-sigs/kustomize/issues/5422#issuecomment-1845401025 I figured it was worth sharing the work. I don't know of any other spots in the code that would need the additions, so there's not much benefit in me keeping the PR private.

Dec 08 '23 02:12 ephesused

So far, no performance changes against v5.3.0 (multiple invocations scenario):

Starting kustomize benchmark on Linux x86_64
kustomize versions: 
  5.2.1
  5.3.0
  PR-5481
iterations per test: 200
tests: 
  1_no-patches
  2_patches-json6902
  3_patches-strategic-merge
  4_no-patches-unknown-kind
  5_component-no-base-no-patches
  6_component-json6902-over-base
  7_component-PSM-over-base
time unit: seconds

             test: 1   test: 2   test: 3   test: 4   test: 5   test: 6   test: 7
    v5.2.1      1.37      1.52     11.96      1.37      2.25      3.10     13.40
    v5.3.0      1.33      1.47     12.29      1.34      1.54      1.92     12.34
   PR-5481      1.37      1.46     12.19      1.30      1.55      1.81     12.29

Dec 08 '23 10:12 shapirus

@shapirus, I'm not surprised #5481 left the PSM performance (#5422) as-is. The optimization in #5481 was aimed at #5425, and in that context it has dramatic improvements. If I can carve some time in the next few weeks, I'll take a close look at #5422 and update there with any useful information I can find.

Dec 08 '23 14:12 ephesused

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Dec 07 '24 14:12 k8s-triage-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 07 '25 14:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 06 '25 15:04 k8s-triage-robot

@shapirus, I'm not surprised #5481 left the PSM performance (#5422) as-is. The optimization in #5481 was aimed at #5425, and in that context it has dramatic improvements. If I can carve some time in the next few weeks, I'll take a close look at #5422 and update there with any useful information I can find.

@ephesused have you had any further time/enthusiasm to look into this? :) PSM performance still remains worse than in the old v3.5.4, whereas the rest of my test cases showed improvements, sometimes pretty significant.

If only we could improve PSM...

/remove-lifecycle rotten

Apr 06 '25 16:04 shapirus

@ephesused have you had any further time/enthusiasm to look into this? :)

I have not had additional time, no. Assuming there haven't been major changes in this area, I suspect a key part of the performance continues to be the default schema load, as suspected here and investigated here.

Apr 07 '25 12:04 ephesused

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 06 '25 13:07 k8s-triage-robot

/remove-lifecycle stale

Jul 06 '25 14:07 shapirus

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 04 '25 15:10 k8s-triage-robot

/remove-lifecycle stale

Oct 04 '25 15:10 shapirus