bbolt icon indicating copy to clipboard operation
bbolt copied to clipboard

Add benchmark tooling

Open mrueg opened this issue 2 years ago • 13 comments

This adds a way to compare benchmarks, hopefully allowing for better comparisons when applying changes.

We use this in https://github.com/kubernetes/kube-state-metrics and it can be integrated with via Github Actions as well if desired.

mrueg avatar Feb 08 '24 22:02 mrueg

Thanks for the PR.

it can be integrated with via Github Actions as well if desired.

Yes, it's desired!

ahrtr avatar Feb 09 '24 09:02 ahrtr

/ok-to-test

jmhbnz avatar Feb 11 '24 18:02 jmhbnz

Please take a look at the workflow failure.

ahrtr avatar Feb 16 '24 08:02 ahrtr

Hey @ahrtr, I think there's value in enabling this with my suggestion due to the recent findings on #720. How would you suggest to proceed with this PR?

ivanvc avatar Apr 23 '24 18:04 ivanvc

I think we should compare the performance between a PR and the main branch.

ahrtr avatar Apr 23 '24 18:04 ahrtr

if PR and main is fine, i can remove the comparison to the previous release.

mrueg avatar Apr 25 '24 11:04 mrueg

if PR and main is fine, i can remove the comparison to the previous release.

Yes, please.

ahrtr avatar Apr 30 '24 15:04 ahrtr

Hi @mrueg, it looks like it's failing because the benchmark job is timing out. We could either just increase the timeout or also split the benchmark into two jobs. I think the latter option would be better because I feel it would optimize for runtime and generate the output faster. What do you think? I can do a PoC or continue the implementation if you don't have the capacity.

Edit: added bold for emphasis on what I initially meant :sweat_smile:

ivanvc avatar May 03 '24 02:05 ivanvc

@ivanvc yes, please feel free to take over. I'm currently lacking time to work on this.

mrueg avatar May 03 '24 16:05 mrueg

I have a branch with the functional code for this. However, benchstat needs at least six benchmark runs to generate a prediction with 95% confidence. Or at least two to give a prediction with around 50%. Each run takes about 20 minutes. IMHO, it's way too long to wait one hour for the result. @ahrtr, thoughts?

ivanvc avatar May 07 '24 04:05 ivanvc

However, benchstat needs at least six benchmark runs to generate a prediction with 95% confidence.

In that case, can we add it as a nightly check?

ahrtr avatar May 07 '24 09:05 ahrtr

However, benchstat needs at least six benchmark runs to generate a prediction with 95% confidence.

In that case, can we add it as a nightly check?

What would it compare against? This would be running against a PR

mrueg avatar May 07 '24 09:05 mrueg

However, benchstat needs at least six benchmark runs to generate a prediction with 95% confidence.

In that case, can we add it as a nightly check?

What would it compare against? This would be running against a PR

Right, it's unrealistic if we need to compare PR vs main.

  • If we run the workflow check on each PR, then we can try larger runner ubuntu-latest-8-cores.
  • Not each PR needs to run performance comparison, such as just document changes or typos. We can also add conditions to control the workflow execution, i.e manually add label performance/compare to trigger it?

Also it might make more sense to compare performance using the bbolt bench tool. I think we can add a separate workflow check for that.

ahrtr avatar May 07 '24 10:05 ahrtr

Here's some discussion and progress regarding these last comments:

I currently managed to do three runs using ubuntu-latest-8-cores runners and tunning the running parameters with {ksize,vsize}=512, which runs under 30 minutes. Using a confidence=0.75, benchstat can generate a valid output. I formatted the result as a Markdown table and published it as a GitHub workflow job summary (we could also send a comment to the PR to make this more visible). See the result here: https://github.com/etcd-io/bbolt/actions/runs/8995575434.

However, I wonder if my original idea of splitting the benchmarks into two runners makes sense because if we have a job scheduled in a degraded host, it may impact the benchmark's result. Doing both runs on the same runner would mean that instead of 30 minutes, the run time would increase to one hour.

Also it might make more sense to compare performance using the bbolt bench tool.

That was my intention with #739, which makes sense. But we'll need to define a set of parameters to run the benchmarks. I'm not sure who would be the best to define them (I'm guessing you @ahrtr 😅), but I suggest keeping that conversation in that issue.

Another idea would be to run these benchmarks as Prow jobs in the Kubernetes infra. In that case, I understand that the max CPU we can request is 7 CPUs. So, I'm unsure if they will be any better than GitHub's ubuntu-latest-8-cores.

ivanvc avatar May 08 '24 11:05 ivanvc

See the result here: https://github.com/etcd-io/bbolt/actions/runs/8995575434.

Look great, thank.

which runs under 30 minutes.

It seems that you run the test in parallel. If you run them sequentially, it will be longer. https://github.com/etcd-io/bbolt/pull/750#discussion_r1594281959

Doing both runs on the same runner

I think we should follow this approach to ensure both tests have exact the same environment.

ahrtr avatar May 08 '24 16:05 ahrtr

It seems that you run the test in parallel. If you run them sequentially, it will be longer. #750 (comment)

Yes, if we run them sequentially, it will take about one hour to complete the job. Is that acceptable?

ivanvc avatar May 09 '24 11:05 ivanvc

It seems that you run the test in parallel. If you run them sequentially, it will be longer. #750 (comment)

Yes, if we run them sequentially, it will take about one hour to complete the job. Is that acceptable?

+1 for running it in sequentially. Multiple fdatasync syscalls would impact each other. Sequence makes sense here.

In that case, I understand that the max CPU we can request is 7 CPUs. So, I'm unsure if they will be any better than GitHub's ubuntu-latest-8-cores.

I remembered that cncf provides 16 cores for us. You can try it with ubuntu-latest-16-cores.

fuweid avatar May 13 '24 11:05 fuweid

I remembered that cncf provides 16 cores for us. You can try it with ubuntu-latest-16-cores.

That's good to know. I'll try it out with them.

ivanvc avatar May 13 '24 18:05 ivanvc

I remembered that cncf provides 16 cores for us. You can try it with ubuntu-latest-16-cores.

That's good to know. I'll try it out with them.

Uh, it looks like we don't have 16 core runners. The job is stalled @ https://github.com/etcd-io/bbolt/actions/runs/9067796547/job/24913688424.

I could try tuning the running parameters to make it finish faster.

ivanvc avatar May 13 '24 23:05 ivanvc

Uh, it looks like we don't have 16 core runners. The job is stalled @ https://github.com/etcd-io/bbolt/actions/runs/9067796547/job/24913688424.

Hmm, I don't have permission to check this in etcd-io org.

Org → Settings → Actions → Runner groups → Default Large Runners

  • Select repositories

Maybe @ahrtr can help check it 😆

fuweid avatar May 14 '24 05:05 fuweid

Uh, it looks like we don't have 16 core runners. The job is stalled @ https://github.com/etcd-io/bbolt/actions/runs/9067796547/job/24913688424.

Hmm, I don't have permission to check this in etcd-io org.

Org → Settings → Actions → Runner groups → Default Large Runners

Hey Team - I believe this will require raising issue to kubernetes as etcd-io org is now managed under Kubernetes GitHub Enterprise account, for context please read:https://github.com/kubernetes/org/issues/4590

jmhbnz avatar May 14 '24 05:05 jmhbnz

@jmhbnz thanks for quick update. Last time, I remember that we have 3 large runners: 4 cores, 8 cores and 16 cores. Read that issue mentioned, the runners had been shutdown during migration. So I believe it has been deleted. If we still need 16 cores, we have to file cncf ticket to re-able it.

fuweid avatar May 14 '24 06:05 fuweid

We only have ubuntu-latest-8-cores enabled for bbolt for now.

Yes, if we run them sequentially, it will take about one hour to complete the job. Is that acceptable?

Have you confirmed that (run them sequentially)? It takes about 50m for the existing tests/coverage test, so one hour might be accepted.

ahrtr avatar May 14 '24 07:05 ahrtr

Have you confirmed that (run them sequentially)? It takes about 50m for the existing tests/coverage test, so one hour might be accepted.

But I have no any objection if anyone wants to request a 16 core runner :)

ahrtr avatar May 14 '24 07:05 ahrtr

It takes about one hour to run the benchmarks sequentially: https://github.com/etcd-io/bbolt/actions/runs/9087376452

If this is acceptable, we can roll it out with 8-core runners. However, I still wonder if the Prow infra would better fit this.

ivanvc avatar May 15 '24 00:05 ivanvc

If this is acceptable, we can roll it out with 8-core runners. However, I still wonder if the Prow infra would better fit this.

Thanks. It's accepted to me. We can add it into github workflow firstly, and consider migrating to Prow later.

ahrtr avatar May 15 '24 09:05 ahrtr

I'll clean up the code and mark the other PR ready for review.

ivanvc avatar May 15 '24 17:05 ivanvc

Thanks @mrueg for initiating the change, and the final implementation was completed by @ivanvc in https://github.com/etcd-io/bbolt/pull/750

ahrtr avatar Jun 28 '24 09:06 ahrtr

Thanks @ivanvc for getting the change in!

mrueg avatar Jun 28 '24 11:06 mrueg