kbench Add more test modes and Implement a metric exporter

Deprecate the flag QUICK_MODE and replace it by MODE
Add more test modes
Implement a metric exporter When the metric exporter is deployed together with the main test container, it will export the test data as Prometheus metrics for collection. This can be useful when users are running a long kbench test and want to observer the performance of the volume over time. For example, users can deploy multiple kbench pods and enable metric exporter. Then user can observe the performance of the volume of each kbench pod overtime. User can also see how performance is impacted as the number of kbench pods is increased in the cluster

Ref:

https://github.com/longhorn/longhorn/pull/7905
https://github.com/longhorn/longhorn/pull/7616
https://github.com/longhorn/longhorn/issues/2598

Feb 08 '24 01:02 PhanLe1010

I will have sample report to demonstrate the use case for this PR in more detailed

Feb 08 '24 01:02 PhanLe1010

For example, this is a test run on a cluster of 3 worker nodes. Each has 4CPUs, 8GiB of RAM. with the test parameters:

Longhorn RWO volumes: each Kbench pod is running test on a Longhorn RWO volume attached to it
The test mode is MODE: random-read-iops: Kbench is running random-read-iops job only
LONG_RUN: true: Kbench will run the random-read-iops job repeatedly
metric-exporter container is deployed on the same pod as the kbench container to export test result as Prometheus metrics
User will scale the number of workload (which is the number of kbench statefulset) from 1 and observer the performance overtime.

This is the Grafana graph for this test:

Screenshot from 2024-02-07 18-19-28

From the graph we can see that:

When there is only 1 workload pod, the number of random IOPs that volume can achieve is 10.7k
When user scales number of workload pod to 3:
1. We can see that each volume can achieve around 7.2k random IOPs
2. Total random IOPs of all volume is about 21.6k
When scale number of workload pod to 6:
1. We can see that each volume can achieve around 3.4k random IOPs
2. Total random IOPs of all volume is about 20.4k

We can see that there is an upper bound amount of IOPs that can be achieved in this cluster (21K). The number of IOPs each volume can achieve (x) and the number of volumes (y) form a reciprocal function: x * y = 21000. If the user keeps scaling up number of pods eventually, this reciprocal relation will no longer hold as the CPU contention and other factors kick in (i.e. x*y will be less and less) or even Longhorn instance manager pods crash which crashes Longhorn volume.

User can use this test result to estimate the answer for some of the questions:

Longhorn can push up to 21K random IOPs in this cluster
If each of your workload pod is doing 1K random IOPs in average, you can have estimatedly 18 pods (not 21 because of the overhead when having more pods)

Feb 08 '24 02:02 PhanLe1010

We are building Longhorn report using this PR https://github.com/longhorn/longhorn/pull/7905

Feb 13 '24 19:02 PhanLe1010

@PhanLe1010 LGTM in general. Two things:

We should use a new Golang version than 1.17.
We need to update doc.

Also, I decided not to use Dapper before because I wanted this to stay vendor-natural. However, if we decided to go that route, it's easy to transfer the ownership of this project to Longhorn. @innobead what do you think?

Feb 15 '24 19:02 yasker

Integrating metrics exporter is a good addition to this project, as it provides a better monitoring and reporting way for users when doing IO benchmark.

Dapper is just one of the build solutions, so for me, the change here is still vendor-natural. However, it would be beneficial to transfer this project to Longhorn, as we can build the multi-arch image automatically in Longhorn CI for our needs. Besides, there could be some potential feature integration to Longhorn such as benchmarking at runtime.

If @yasker is willing to transfer this project, then sounds good to me.

Feb 16 '24 10:02 innobead

@innobead Sounds good. I will transfer it over then.

Feb 16 '24 17:02 yasker

Thanks all! I will create the PR into the new Longhorn repo

Feb 16 '24 18:02 PhanLe1010

This PR is ready for review cc @derekbit @shuo-wu @ejweber @james-munson

For the todo items: I will modify the document and upgrade Golang version tomorrow

Apr 09 '24 00:04 PhanLe1010

For the todo items: I will modify the document and upgrade Golang version tomorrow

I will open a new PR for this

Apr 27 '24 00:04 PhanLe1010

kbench kbench copied to clipboard

Add more test modes and Implement a metric exporter

kbench
kbench copied to clipboard