establish a benchmarking method
Why
For youki, performance is one of the most important concerns. Until now, we have relied on our individual senses for this, but we should prepare the official and easy way to measure it more accurately.
Goal
Provide documentation and scripts to measure performance. Ideally, it should be incorporated into the CI and measured by PR every time.
a example https://github.com/containers/youki/pull/447#issuecomment-958991982
I am looking for people who are willing to take on this challenge.
I had actually been fooling around with benchmarking Youki. I had previously used the Criterion crate to benchmark Rust projects for work and have really enjoyed it. I'm also aware of flamegraph-rs which can give performance profiles. I'm not sure how well these things might work since Youki works in a different way from many other applications, such as forking processes, but currently I have a mostly working Criterion setup on this branch.
A couple of problems I face are having to redo some of the setup logic, some general weirdness around relative paths, it's really a pain to run anything but rootless mode since I almost never have a system wide install of the rust toolchain, and I currently cannot have the benchmarks cleanup their containers since the current delete routine exits the process which means the test process exits whenever I try to delete a container.
This is just my adventure in trying to leverage Rust tooling to get some benchmarks, but perhaps there are other tools that might make more sense which can just execute youki as a separate process. Though I think benchmarks are just the most surface level thing we can do, especially in such a coarse manner. It would be really nice to get profiling so we can understand how things are performing and where the issues might be.
I'd be interested in diving in further here.
@tsturzl I agree with your point, so what I wanna raise here is a part of the benchmark progress.
Since I have some idea with the libcontainer lib wanna do some modifications, but I am afraid to effect the performance, so I want to add a PR to do the relative benchmark by the GitHub action with a label triggering way to get the benchmark criteria of each PR's build compares to main branch's build.
And since the GitHub action running independently and remotely won't screwed up the local env I think this one can be added up for the first step.
The comparison will be quite simple, in my imagination it just using the tool @utam0k used then running few commands like echo a string, ls a path by the youki with the same conditions, that's all.
If this is not an acceptable way please correct and guide me 🙇♂️
@tsturzl I agree with your point, so what I wanna raise here is a part of the benchmark progress.
Since I have some idea with the libcontainer lib wanna do some modifications, but I am afraid to effect the performance, so I want to add a PR to do the relative benchmark by the GitHub action with a label triggering way to get the benchmark criteria of each PR's build compares to main branch's build.
And since the GitHub action running independently and remotely won't screwed up the local env I think this one can be added up for the first step.
The comparison will be quite simple, in my imagination it just using the tool @utam0k used then running few commands like echo a string, ls a path by the youki with the same conditions, that's all.
If this is not an acceptable way please correct and guide me bowing_man
@tommady Your suggestion is a good first step. Looks good. I agree with you. Let's give It a try I'm imagining https://github.com/TaKO8Ki/frum/pull/16.
@Furisto @yihuaf If you have any suggestions, please let me know.
@tommady In general I am not a fan of github action or CI for performance benchmark. A local machine, ideally a physical box for this, should be preferred. There are too many variables we can't control using github actions and we run the risk of garbage in and garbage out.
In terms of benchmark, we have two separate pieces.
The first is benchmark youki as a binary. The goal here should be automation to bench mark a variety of different usecases, and compare to runc and/or crun. hyperfine is a good choice. We can also explore if we can use hyperfine as a library to produce something similar to our integration test crate.
The second piece is benchmark individual functions and/or modules. This will help us narrow down places where we need optimization. This should be used in addition to a profiler that will help us understand this deeper. I believe cargo and rust has build in benchmark support this similar to unit testing. This will likely give us better signal on where to optimize.
Let me also reach out to a few friends who specializes in doing performance testing and see if they have better advice on how we can do this.
@yihuaf absolutely, and I was about to say the same thing actually. GitHub actions are simply not a good or stable way of performing benchmarks. We don't know how those VMs are setup, they could burst in performance, they could be suffocated by a noisy neighbor, the performance is too varied to expect reliable benchmarks.
Flamegraph actually has an excellent and detailed guide on analyzing performance, and I'd really recommend reading it. I think it's similar to what you're saying. Benchmark from the broader scope, profile it, use that profiling to select your optimization targets.
My Criterion based benchmarks are admittedly quite hacky at the moment, but I was hoping to simplify things by having the benchmark run all the steps in a single process. Currently the hyperfine benchmarks already consider the create-start-delete to be one iteration of the benchmark, so this is no different in that way. The benefit is that it's much easier to get flamegraph profiling over a single process rather than 3. Also Criterion is exactly what you're talking about, it's utilizing rusts built-in benchmark capabilities. Both flamegraph and Criterion essentially extend Rust tooling, and my branch is basically achieving exactly what hyperfine would be doing.
I hope I'm being clear in what I'm trying to say. I definitely recommend looking at criterion as it's probably the most popular benchmarking tool for Rust. Starting with it may also provide a solid foundation for creating benchmarks around specific areas of concern.
As far as a strategy for benchmarking regressions and improvements. I think benchmarks should be specific to the local machine your benchmarking on, as in you never check in your benchmark results, they are not tracked by git. Instead you record the last benchmark to compare against that way you can change branches to compare performance against main branch or a specific release. This does leave a lot for the user to do, but it's probably the only convenient way we can measure changes in performance reliably. It would also be up to you to communicate this performance change in your PR comment, and this performance change should be relatively easy to cross-examine by a reviewer.
Maybe once we have a reliable benchmarking strategy we can think about possibly finding a way to setup some dedicated hardware provided some individual or group would be so kind as to provide those resources, and then we could automatically benchmark each PR against main or the latest release/tag.
Hi @tsturzl @yihuaf , I totally agreed with your points, but what I am thinking about is just like this article said:
What if we only want to detect regressions introduced in a PR?
Or in the current commit against the last release?
We do not need absolute measurements, just a way to compare two commits in a reliable way.
This is exactly what we meant by relative benchmarking.
The idea is to run the benchmarking suite in the same CI job for the two commits we are comparing.
This way, we guarantee we are using the exact same machine and configuration for both.
By doing this, we are essentially canceling out the background noise and can estimate the performance ratio of the two commits.
so let the CI do a relative benchmark ( in my idea the benchmark should be fast and small ) to provide a matrix for everyone to have a quick view.
WDYT? 🙇🏻
@tommady I think the relative benchmark can even be inconsistent enough to provide misleading results. I think sometimes a benchmark might not be super influenced by the system, but in this case in order to even get a reliable or useful benchmark you need to warmup and then run a number of iterations which could be influenced by too many external factors. I'd be willing to see this be tested, but my intuition is that a hyperfine or criterion benchmark in CI will be more misleading and than useful.
However, there might be some metric of performance we can use. This article is pretty interesting and possibly provides some solution. We could use some of the metrics from something like Cachegrind to analyze the performance characteristics of a program in CI. The only downside is it's pretty low level and it's certainly more difficult to interpret than execution time. Unfortunately, I don't think this is a good way to account for off CPU time, but this may be a start.
Sorry, I didn't word that well. Of course, local machine measurement is preferable. However, I think it is useful enough to compare the performance with the main branch in PR's CI, although it is only a rough guide. I don't think it's realistic for reviewers to benchmark against all PRs at reviwer's local machine every time. However, I have never attempted this and would like to hear your insights. If it implements easy, you can challenge using hyperfine and see the results rather than discussing it and make a decision.
Maybe a good strategy here is to inform the need for benchmarks on a specific PR using something that's more deterministic to get performance insight? Maybe using something like Cachegrind to analyze a PR branch to see if CPU instructions count falls beyond a certain threshold, we can inform a person to run the benchmark tools to compare against master? So cachegrind can be used in some capacity to see if a benchmark is even needed, but we still have some way to catch changes that greatly improve or degrade performance and then we can reassess a benchmark using a common piece of dedicated hardware to actually run the benchmark (ie the reviewers computer). We could even just use a CPU instructions metric as the main metric to begin with instead of execution time.
I mean I have a hard time getting a benchmark to run without a 6% deviation on my machine, I think that only gets worse in a github action. I think Cachegrind gives us a deterministic and very high confidence metric to gauge our performance by. I also think that Youki in general is pretty subject to a high level of benchmark variance because it's relying heavily on the kernel. I don't think it's like benchmarking an algorithm. Similarly that's why SQLite uses Cachegrind for it's performance and doesn't rely on benchmarks at all, because performance can vary too much for outside reason. The more I read, the less I feel like a benchmark is all that useful of a metric to begin with, and may only really have use in comparing Youki to other runtimes like runc and crun, or when evaluating in combination with a profiler that actually makes sense of the benchmark.
Some relevant reading:
@tsturzl I'd like to see performance measurements using Cachegrind. Could you please, I think it would be to see how messy the performance measurement on github actions can be. We have already gotten hyperfine` measurements with a simple command.