kbench
kbench copied to clipboard
Add more test modes and Implement a metric exporter
- Deprecate the flag QUICK_MODE and replace it by MODE
- Add more test modes
- Implement a metric exporter When the metric exporter is deployed together with the main test container, it will export the test data as Prometheus metrics for collection. This can be useful when users are running a long kbench test and want to observer the performance of the volume over time. For example, users can deploy multiple kbench pods and enable metric exporter. Then user can observe the performance of the volume of each kbench pod overtime. User can also see how performance is impacted as the number of kbench pods is increased in the cluster
Ref:
- https://github.com/longhorn/longhorn/pull/7905
- https://github.com/longhorn/longhorn/pull/7616
- https://github.com/longhorn/longhorn/issues/2598
I will have sample report to demonstrate the use case for this PR in more detailed
For example, this is a test run on a cluster of 3 worker nodes. Each has 4CPUs, 8GiB of RAM. with the test parameters:
- Longhorn RWO volumes: each Kbench pod is running test on a Longhorn RWO volume attached to it
- The test mode is
MODE: random-read-iops: Kbench is runningrandom-read-iopsjob only LONG_RUN: true: Kbench will run therandom-read-iopsjob repeatedlymetric-exportercontainer is deployed on the same pod as thekbenchcontainer to export test result as Prometheus metrics- User will scale the number of workload (which is the number of kbench statefulset) from 1 and observer the performance overtime.
This is the Grafana graph for this test:
From the graph we can see that:
- When there is only 1 workload pod, the number of random IOPs that volume can achieve is 10.7k
- When user scales number of workload pod to 3:
- We can see that each volume can achieve around 7.2k random IOPs
- Total random IOPs of all volume is about 21.6k
- When scale number of workload pod to 6:
- We can see that each volume can achieve around 3.4k random IOPs
- Total random IOPs of all volume is about 20.4k
We can see that there is an upper bound amount of IOPs that can be achieved in this cluster (21K). The number of IOPs each volume can achieve (x) and the number of volumes (y) form a reciprocal function: x * y = 21000. If the user keeps scaling up number of pods eventually, this reciprocal relation will no longer hold as the CPU contention and other factors kick in (i.e. x*y will be less and less) or even Longhorn instance manager pods crash which crashes Longhorn volume.
User can use this test result to estimate the answer for some of the questions:
- Longhorn can push up to 21K random IOPs in this cluster
- If each of your workload pod is doing 1K random IOPs in average, you can have estimatedly 18 pods (not 21 because of the overhead when having more pods)
We are building Longhorn report using this PR https://github.com/longhorn/longhorn/pull/7905
@PhanLe1010 LGTM in general. Two things:
- We should use a new Golang version than 1.17.
- We need to update doc.
Also, I decided not to use Dapper before because I wanted this to stay vendor-natural. However, if we decided to go that route, it's easy to transfer the ownership of this project to Longhorn. @innobead what do you think?
Integrating metrics exporter is a good addition to this project, as it provides a better monitoring and reporting way for users when doing IO benchmark.
Dapper is just one of the build solutions, so for me, the change here is still vendor-natural. However, it would be beneficial to transfer this project to Longhorn, as we can build the multi-arch image automatically in Longhorn CI for our needs. Besides, there could be some potential feature integration to Longhorn such as benchmarking at runtime.
If @yasker is willing to transfer this project, then sounds good to me.
@innobead Sounds good. I will transfer it over then.
Thanks all! I will create the PR into the new Longhorn repo
This PR is ready for review cc @derekbit @shuo-wu @ejweber @james-munson
For the todo items: I will modify the document and upgrade Golang version tomorrow
For the todo items: I will modify the document and upgrade Golang version tomorrow
I will open a new PR for this