Add benchmark tooling
This adds a way to compare benchmarks, hopefully allowing for better comparisons when applying changes.
We use this in https://github.com/kubernetes/kube-state-metrics and it can be integrated with via Github Actions as well if desired.
Thanks for the PR.
it can be integrated with via Github Actions as well if desired.
Yes, it's desired!
/ok-to-test
Please take a look at the workflow failure.
Hey @ahrtr, I think there's value in enabling this with my suggestion due to the recent findings on #720. How would you suggest to proceed with this PR?
I think we should compare the performance between a PR and the main branch.
if PR and main is fine, i can remove the comparison to the previous release.
if PR and main is fine, i can remove the comparison to the previous release.
Yes, please.
Hi @mrueg, it looks like it's failing because the benchmark job is timing out. We could either just increase the timeout or also split the benchmark into two jobs. I think the latter option would be better because I feel it would optimize for runtime and generate the output faster. What do you think? I can do a PoC or continue the implementation if you don't have the capacity.
Edit: added bold for emphasis on what I initially meant :sweat_smile:
@ivanvc yes, please feel free to take over. I'm currently lacking time to work on this.
I have a branch with the functional code for this. However, benchstat needs at least six benchmark runs to generate a prediction with 95% confidence. Or at least two to give a prediction with around 50%. Each run takes about 20 minutes. IMHO, it's way too long to wait one hour for the result. @ahrtr, thoughts?
However,
benchstatneeds at least six benchmark runs to generate a prediction with 95% confidence.
In that case, can we add it as a nightly check?
However,
benchstatneeds at least six benchmark runs to generate a prediction with 95% confidence.In that case, can we add it as a nightly check?
What would it compare against? This would be running against a PR
However,
benchstatneeds at least six benchmark runs to generate a prediction with 95% confidence.In that case, can we add it as a nightly check?
What would it compare against? This would be running against a PR
Right, it's unrealistic if we need to compare PR vs main.
- If we run the workflow check on each PR, then we can try larger runner
ubuntu-latest-8-cores. - Not each PR needs to run performance comparison, such as just document changes or typos. We can also add conditions to control the workflow execution, i.e manually add label
performance/compareto trigger it?
Also it might make more sense to compare performance using the bbolt bench tool. I think we can add a separate workflow check for that.
Here's some discussion and progress regarding these last comments:
I currently managed to do three runs using ubuntu-latest-8-cores runners and tunning the running parameters with {ksize,vsize}=512, which runs under 30 minutes. Using a confidence=0.75, benchstat can generate a valid output. I formatted the result as a Markdown table and published it as a GitHub workflow job summary (we could also send a comment to the PR to make this more visible). See the result here: https://github.com/etcd-io/bbolt/actions/runs/8995575434.
However, I wonder if my original idea of splitting the benchmarks into two runners makes sense because if we have a job scheduled in a degraded host, it may impact the benchmark's result. Doing both runs on the same runner would mean that instead of 30 minutes, the run time would increase to one hour.
Also it might make more sense to compare performance using the bbolt bench tool.
That was my intention with #739, which makes sense. But we'll need to define a set of parameters to run the benchmarks. I'm not sure who would be the best to define them (I'm guessing you @ahrtr 😅), but I suggest keeping that conversation in that issue.
Another idea would be to run these benchmarks as Prow jobs in the Kubernetes infra. In that case, I understand that the max CPU we can request is 7 CPUs. So, I'm unsure if they will be any better than GitHub's ubuntu-latest-8-cores.
See the result here: https://github.com/etcd-io/bbolt/actions/runs/8995575434.
Look great, thank.
which runs under 30 minutes.
It seems that you run the test in parallel. If you run them sequentially, it will be longer. https://github.com/etcd-io/bbolt/pull/750#discussion_r1594281959
Doing both runs on the same runner
I think we should follow this approach to ensure both tests have exact the same environment.
It seems that you run the test in parallel. If you run them sequentially, it will be longer. #750 (comment)
Yes, if we run them sequentially, it will take about one hour to complete the job. Is that acceptable?
It seems that you run the test in parallel. If you run them sequentially, it will be longer. #750 (comment)
Yes, if we run them sequentially, it will take about one hour to complete the job. Is that acceptable?
+1 for running it in sequentially. Multiple fdatasync syscalls would impact each other. Sequence makes sense here.
In that case, I understand that the max CPU we can request is 7 CPUs. So, I'm unsure if they will be any better than GitHub's ubuntu-latest-8-cores.
I remembered that cncf provides 16 cores for us. You can try it with ubuntu-latest-16-cores.
I remembered that cncf provides 16 cores for us. You can try it with ubuntu-latest-16-cores.
That's good to know. I'll try it out with them.
I remembered that cncf provides 16 cores for us. You can try it with ubuntu-latest-16-cores.
That's good to know. I'll try it out with them.
Uh, it looks like we don't have 16 core runners. The job is stalled @ https://github.com/etcd-io/bbolt/actions/runs/9067796547/job/24913688424.
I could try tuning the running parameters to make it finish faster.
Uh, it looks like we don't have 16 core runners. The job is stalled @ https://github.com/etcd-io/bbolt/actions/runs/9067796547/job/24913688424.
Hmm, I don't have permission to check this in etcd-io org.
Org → Settings → Actions → Runner groups → Default Large Runners
- Select repositories
Maybe @ahrtr can help check it 😆
Uh, it looks like we don't have 16 core runners. The job is stalled @ https://github.com/etcd-io/bbolt/actions/runs/9067796547/job/24913688424.
Hmm, I don't have permission to check this in etcd-io org.
Org → Settings → Actions → Runner groups → Default Large Runners
Hey Team - I believe this will require raising issue to kubernetes as etcd-io org is now managed under Kubernetes GitHub Enterprise account, for context please read:https://github.com/kubernetes/org/issues/4590
@jmhbnz thanks for quick update. Last time, I remember that we have 3 large runners: 4 cores, 8 cores and 16 cores. Read that issue mentioned, the runners had been shutdown during migration. So I believe it has been deleted. If we still need 16 cores, we have to file cncf ticket to re-able it.
We only have ubuntu-latest-8-cores enabled for bbolt for now.
Yes, if we run them sequentially, it will take about one hour to complete the job. Is that acceptable?
Have you confirmed that (run them sequentially)? It takes about 50m for the existing tests/coverage test, so one hour might be accepted.
Have you confirmed that (run them sequentially)? It takes about 50m for the existing tests/coverage test, so one hour might be accepted.
But I have no any objection if anyone wants to request a 16 core runner :)
It takes about one hour to run the benchmarks sequentially: https://github.com/etcd-io/bbolt/actions/runs/9087376452
If this is acceptable, we can roll it out with 8-core runners. However, I still wonder if the Prow infra would better fit this.
If this is acceptable, we can roll it out with 8-core runners. However, I still wonder if the Prow infra would better fit this.
Thanks. It's accepted to me. We can add it into github workflow firstly, and consider migrating to Prow later.
I'll clean up the code and mark the other PR ready for review.
Thanks @mrueg for initiating the change, and the final implementation was completed by @ivanvc in https://github.com/etcd-io/bbolt/pull/750
Thanks @ivanvc for getting the change in!