kustomize icon indicating copy to clipboard operation
kustomize copied to clipboard

perf: Add full program benchmark for kustomize build

Open chlunde opened this issue 1 year ago • 13 comments

This change introduces a benchmarking test that constructs a complete kustomization tree using various features of Kustomize.

This update aims to address several objectives:

  • Demonstrating current performance challenges in Kustomize in a reproducible manner.
  • Evaluating the effects of performance enhancements.
  • Guarding against potential performance setbacks and inadvertent quadratic behavior in the future.
  • Considering the possibility of incorporating profile-guided optimization (PGO) in future iterations.

Usage:

go test -run=x -bench=BenchmarkBuild ./kustomize/commands/build

# sigs.k8s.io/kustomize/kustomize/v5/commands/build.test
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
BenchmarkBuild-8   	       1	8523677542 ns/op
PASS
ok  	sigs.k8s.io/kustomize/kustomize/v5/commands/build	8.798s

Currently, this benchmark requires 3000 seconds to run on my machine. In order to run it on master today, you need to add -timeout=30m to the go test command.

The dataset size was chosen because I believe it represents a real workload which we could get a runtime of less than 10 seconds.

Updates #5084

Notes on PGO:

Real-life profiles would be better, but creating one based on a benchmark should not hurt:

https://go.dev/doc/pgo#collecting-profiles

Will PGO with an unrepresentative profile make my program slower than no PGO? It should not. While a profile that is not representative of production behavior will result in optimizations in cold parts of the application, it should not make hot parts of the application slower. If you encounter a program where PGO results in worse performance than disabling PGO, please file an issue at https://go.dev/issue/new.

Collecting a profile:

go test -cpuprofile cpu1.pprof -run=^$ -bench ^BenchmarkBuild$ sigs.k8s.io/kustomize/kustomize/v5/commands/build

go build -pgo=./cpu1.pprof  -o kust-pgo ./kustomize
go build -o kust-nopgo ./kustomize

Compare PGO and non-PGO-builds:

./kust-pgo build -o /dev/null testdata/  21.88s user 2.00s system 176% cpu 13.505 total
./kust-nopgo build -o /dev/null testdata/  22.76s user 1.98s system 174% cpu 14.170 total

chlunde avatar Oct 30 '23 11:10 chlunde

Hi @chlunde. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 30 '23 11:10 k8s-ci-robot

One important aspect that I'd like to point out here is that the benchmarks should cover the following cases:

  • single invocation of kustomize to process a big and complex resource tree
  • multiple invocations of kustomize to process a small set of manifests on each run (simulates e.g. argocd handling many microservices, or a validation CI job running on the repository hosting a large set of manifests)

These cases are very different in terms of the performance impact that specific parts of code create, yet each of them is quite valid as far as practical usage is concerned.

shapirus avatar Oct 30 '23 13:10 shapirus

/ok-to-test /triage accepted

natasha41575 avatar Oct 30 '23 15:10 natasha41575

In hindsight, it makes no sense to merge this with the current performance. Either we wait for a few of them to be merged or we adjust some of the sizing parameters for reasonable numbers (runtime <10m in CI).

I was a bit surprised at how long the baseline performance was, as I had quite a few performance patches locally at the time I wrote the benchmark.

@shapirus yeah, we should have some benchmarks emulating argo/flux too. Do you think this affects how this PR is structured? I think we should look at the API surface used by argo/flux so we use the right entry points, and we should probably avoid using the file system, so I think it would be a different benchmark.

chlunde avatar Oct 30 '23 21:10 chlunde

In hindsight, it makes no sense to merge this with the current performance. Either we wait for a few of them to be merged or we adjust some of the sizing parameters for reasonable numbers (runtime <10m in CI).

I agree that it would be beneficial to limit the duration and scope of these performance tests. I really like the suggestion from @shapirus. Ideally, we can run meaningful benchmark tests as a PR check. This would help us understand if a specific commit significantly changed the performance.

ncapps avatar Oct 30 '23 21:10 ncapps

@shapirus yeah, we should have some benchmarks emulating argo/flux too. Do you think this affects how this PR is structured?

@chlunde As far as I can see, it is basically a separate go program that imports kustomize as a library and then runs tests in its own context. This will only allow to benchmark the "single invocation, big resource tree" scenario, as it makes the startup overhead take place only once (see https://github.com/kubernetes-sigs/kustomize/issues/5422#issuecomment-1783488275).

To emulate argocd/fluxcd we need to, well, do what they do: invoke an actual kustomize binary (which, in the CI flow, should be already built by the time the benchmarks run, if I understand correctly) a certain number of times and measure the total run time.

My set of benchmarks at https://github.com/shapirus/kustomize-benchmark-suite (Dockerfile serves as a readme there) does exactly that. Probably some core code could be reused for the kustomize CI. Yes it's somewhat crude, but also small enough to be understood and reused. If we're limited to go (and e.g. can't use shell), it's no problem too: go can spawn processes to run commands and measure their execution time just as well, and with a cleaner code.

Unfortunately I lack the knowledge required to convert them into a proper CI workflow step for kustomize, but it should be easy, if not trivial, for those who don't.

I think we should look at the API surface used by argo/flux so we use the right entry points

There is really no need to do this, I think. Can't tell about fluxcd, but argocd simply runs a standalone kustomize executable binary to build the resulting manifest for each Application that it has been requested for. This is easily simulated with a simple shell script. Let's try to avoid overengineering :).

and we should probably avoid using the file system, so I think it would be a different benchmark.

In the ideal world, yes, but in practice it doesn't make any discernable difference, unless the underlying block device is very slow and/or high latency. On the other hand, it adds extra complications.

To reduce the effects of the disk access overhead sufficiently for our specific practical use case, we can run find ./manifests-tree -type f -exec cat {} > /dev/null \; before running the benchmark to pre-heat page cache by reading all the files we're going to need.

My bigger concern is not the file system, but the shared CPU on the machine where the benchmark runs. If there are other CPU-intensive jobs running on the same machine, they can affect the benchmark execution times quite significantly. There are possible workarounds however:

  • run the benchmark process with nice -19 (not always possible, requires privileges)
  • use a sufficiently high number of iterations to make the benchmark run at least a few tens of seconds to reduce the effects of small lags and fluctuations (i.e. you can't reliably benchmark when your test takes, say, a second or less, unless you have hardware dedicated to this test alone)
  • run the same benchmark several times and use the results of the one that completed in the shortest time (aka had most of the CPU available to itself alone)
  • use a sufficently high threshold to mark the test failed upon detection of a regression against a previous version (say > 40%) -- however, this makes it necessary to test not only against that one version, but several previous versions to detect the cumulative effect of small incremental regressions, and implementing this logic properly to detect regressions reliably and without false positives may be non-trivial. There are probably existing tools, someone with a QA background might give more ideas.

shapirus avatar Oct 31 '23 09:10 shapirus

In poking around at this benchmark, I did notice some behavior that surprised me. I adjusted the value of genconfig[1].resources and the run time jump between 1 and 2 stands out dramatically. @chlunde (or others), do you see the same behavior?

Admittedly, I haven't taken the time to understand the test case well.

$ DEPTH_ONE_RESOURCE_COUNT=1 go test ./kustomize/commands/build -run nope -bench BenchmarkBuild
goos: windows
goarch: amd64                                         
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz          
BenchmarkBuild-8        1000000000               0.4780 ns/op
--- BENCH: BenchmarkBuild-8                                           
    build_benchmark_test.go:162: genconfig[1].resources redefined as 1
PASS                                                                  
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       12.829s

$ DEPTH_ONE_RESOURCE_COUNT=2 go test ./kustomize/commands/build -run nope -bench BenchmarkBuild
goos: windows
goarch: amd64                                         
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz          
BenchmarkBuild-8               1        1639273100 ns/op
--- BENCH: BenchmarkBuild-8
    build_benchmark_test.go:162: genconfig[1].resources redefined as 2
PASS
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       3.119s

$ DEPTH_ONE_RESOURCE_COUNT=3 go test ./kustomize/commands/build -run nope -bench BenchmarkBuild
goos: windows
goarch: amd64
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
BenchmarkBuild-8               1        2591509400 ns/op
--- BENCH: BenchmarkBuild-8
    build_benchmark_test.go:162: genconfig[1].resources redefined as 3
PASS
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       3.841s

$ DEPTH_ONE_RESOURCE_COUNT=4 go test ./kustomize/commands/build -run nope -bench BenchmarkBuild
goos: windows
goarch: amd64
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
BenchmarkBuild-8               1        3785031200 ns/op
--- BENCH: BenchmarkBuild-8
    build_benchmark_test.go:162: genconfig[1].resources redefined as 4
PASS
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       5.099s

$

ephesused avatar Nov 02 '23 19:11 ephesused

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 01 '24 15:02 k8s-triage-robot

/remove-lifecycle stale

chlunde avatar Feb 01 '24 21:02 chlunde

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chlunde Once this PR has been reviewed and has the lgtm label, please assign koba1t for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Apr 16 '24 19:04 k8s-ci-robot

I've added a few more code comments as requested, rebased on master and updated the code to use labels instead of commonLabels.

I've also reduced the dataset here to run in less than one minute on my machine (48s). On top of the PR #5427 it takes around 2.4 seconds. @ncapps is there anything else you would like?

chlunde avatar Apr 16 '24 19:04 chlunde

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 21 '24 00:07 k8s-triage-robot

Hi again, @chlunde!

Are you still interested in contributing these changes? I had a few open questions in my previous review that are still awaiting for an answer.

/remove-lifecycle stale

stormqueen1990 avatar Jul 31 '24 21:07 stormqueen1990