kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-17730: Fix ReplicaFetcherThreadBenchmark

Open wernerdv opened this issue 1 year ago • 2 comments

NotLeaderOrFollowerException occurs here https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/RemoteLeaderEndPoint.scala#L188

The current fix is to catch and ignore NotLeaderOrFollowerException.

Local benchmark result:

./jmh-benchmarks/jmh.sh ReplicaFetcherThreadBenchmark
running gradlew :jmh-benchmarks:clean :jmh-benchmarks:shadowJar

> Configure project :
Starting build with version 4.0.0-SNAPSHOT (commit id 94c7ede7) using Gradle 8.10, Java 17 and Scala 2.13.15
Build properties: ignoreFailures=false, maxParallelForks=6, maxScalacThreads=6, maxTestRetries=0

Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

For more on this, please refer to https://docs.gradle.org/8.10/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.

BUILD SUCCESSFUL in 22s
96 actionable tasks: 23 executed, 73 up-to-date
gradle build done
running JMH with args: ReplicaFetcherThreadBenchmark
# JMH version: 1.37
# VM version: JDK 17.0.12, OpenJDK 64-Bit Server VM, 17.0.12+7-Ubuntu-1ubuntu222.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 15 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.kafka.jmh.fetcher.ReplicaFetcherThreadBenchmark.testFetcher
# Parameters: (partitionCount = 100)

# Run progress: 0,00% complete, ETA 00:13:20
# Fork: 1 of 1
# Warmup Iteration   1: [2024-10-10 13:36:57,811] WARN The new 'consumer' rebalance protocol is only supported in KRaft cluster with the new group coordinator. (kafka.server.KafkaConfig:70)
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
1929,906 ns/op
# Warmup Iteration   2: 1860,040 ns/op
# Warmup Iteration   3: 1879,765 ns/op
# Warmup Iteration   4: 1884,042 ns/op
# Warmup Iteration   5: 1875,712 ns/op
Iteration   1: 1877,666 ns/op
Iteration   2: 1885,357 ns/op
Iteration   3: 1876,356 ns/op
Iteration   4: 1874,775 ns/op
Iteration   5: 1875,129 ns/op
Iteration   6: 1872,721 ns/op
Iteration   7: 1876,337 ns/op
Iteration   8: 1890,266 ns/op
Iteration   9: 1870,369 ns/op
Iteration  10: 1885,525 ns/op
Iteration  11: 1989,414 ns/op
Iteration  12: 1912,892 ns/op
Iteration  13: 1922,298 ns/op
Iteration  14: 1902,687 ns/op
Iteration  15: 1906,352 ns/op


Result "org.apache.kafka.jmh.fetcher.ReplicaFetcherThreadBenchmark.testFetcher":
  1894,543 ±(99.9%) 32,917 ns/op [Average]
  (min, avg, max) = (1870,369, 1894,543, 1989,414), stdev = 30,790
  CI (99.9%): [1861,626, 1927,460] (assumes normal distribution)


# JMH version: 1.37
# VM version: JDK 17.0.12, OpenJDK 64-Bit Server VM, 17.0.12+7-Ubuntu-1ubuntu222.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 15 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.kafka.jmh.fetcher.ReplicaFetcherThreadBenchmark.testFetcher
# Parameters: (partitionCount = 500)

# Run progress: 25,00% complete, ETA 00:10:12
# Fork: 1 of 1
# Warmup Iteration   1: [2024-10-10 13:40:22,069] WARN The new 'consumer' rebalance protocol is only supported in KRaft cluster with the new group coordinator. (kafka.server.KafkaConfig:70)
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
8464,782 ns/op
# Warmup Iteration   2: 8192,703 ns/op
# Warmup Iteration   3: 8162,707 ns/op
# Warmup Iteration   4: 8122,797 ns/op
# Warmup Iteration   5: 8169,713 ns/op
Iteration   1: 8057,133 ns/op
Iteration   2: 8053,061 ns/op
Iteration   3: 8077,125 ns/op
Iteration   4: 8039,068 ns/op
Iteration   5: 8024,524 ns/op
Iteration   6: 8035,134 ns/op
Iteration   7: 8013,353 ns/op
Iteration   8: 8018,225 ns/op
Iteration   9: 8021,750 ns/op
Iteration  10: 8053,567 ns/op
Iteration  11: 8047,978 ns/op
Iteration  12: 8515,976 ns/op
Iteration  13: 8523,523 ns/op
Iteration  14: 8521,076 ns/op
Iteration  15: 8524,231 ns/op


Result "org.apache.kafka.jmh.fetcher.ReplicaFetcherThreadBenchmark.testFetcher":
  8168,382 ±(99.9%) 236,112 ns/op [Average]
  (min, avg, max) = (8013,353, 8168,382, 8524,231), stdev = 220,860
  CI (99.9%): [7932,269, 8404,494] (assumes normal distribution)


# JMH version: 1.37
# VM version: JDK 17.0.12, OpenJDK 64-Bit Server VM, 17.0.12+7-Ubuntu-1ubuntu222.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 15 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.kafka.jmh.fetcher.ReplicaFetcherThreadBenchmark.testFetcher
# Parameters: (partitionCount = 1000)

# Run progress: 50,00% complete, ETA 00:06:56
# Fork: 1 of 1
# Warmup Iteration   1: [2024-10-10 13:43:53,904] WARN The new 'consumer' rebalance protocol is only supported in KRaft cluster with the new group coordinator. (kafka.server.KafkaConfig:70)
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
16887,223 ns/op
# Warmup Iteration   2: 16481,102 ns/op
# Warmup Iteration   3: 16141,360 ns/op
# Warmup Iteration   4: 16114,730 ns/op
# Warmup Iteration   5: 16072,493 ns/op
Iteration   1: 15944,404 ns/op
Iteration   2: 16098,280 ns/op
Iteration   3: 15944,495 ns/op
Iteration   4: 16056,134 ns/op
Iteration   5: 15999,214 ns/op
Iteration   6: 16086,102 ns/op
Iteration   7: 16064,142 ns/op
Iteration   8: 16058,817 ns/op
Iteration   9: 16059,667 ns/op
Iteration  10: 16082,960 ns/op
Iteration  11: 16037,771 ns/op
Iteration  12: 15971,635 ns/op
Iteration  13: 15983,740 ns/op
Iteration  14: 15946,546 ns/op
Iteration  15: 16033,504 ns/op


Result "org.apache.kafka.jmh.fetcher.ReplicaFetcherThreadBenchmark.testFetcher":
  16024,494 ±(99.9%) 58,477 ns/op [Average]
  (min, avg, max) = (15944,404, 16024,494, 16098,280), stdev = 54,699
  CI (99.9%): [15966,017, 16082,971] (assumes normal distribution)


# JMH version: 1.37
# VM version: JDK 17.0.12, OpenJDK 64-Bit Server VM, 17.0.12+7-Ubuntu-1ubuntu222.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 15 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.kafka.jmh.fetcher.ReplicaFetcherThreadBenchmark.testFetcher
# Parameters: (partitionCount = 5000)

# Run progress: 75,00% complete, ETA 00:03:32
# Fork: 1 of 1
# Warmup Iteration   1: [2024-10-10 13:47:35,385] WARN The new 'consumer' rebalance protocol is only supported in KRaft cluster with the new group coordinator. (kafka.server.KafkaConfig:70)
OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
90793,927 ns/op
# Warmup Iteration   2: 87362,350 ns/op
# Warmup Iteration   3: 86543,760 ns/op
# Warmup Iteration   4: 86226,549 ns/op
# Warmup Iteration   5: 87073,001 ns/op
Iteration   1: 87169,335 ns/op
Iteration   2: 87936,322 ns/op
Iteration   3: 87299,357 ns/op
Iteration   4: 88002,498 ns/op
Iteration   5: 86985,872 ns/op
Iteration   6: 87270,992 ns/op
Iteration   7: 86482,991 ns/op
Iteration   8: 86770,086 ns/op
Iteration   9: 85933,541 ns/op
Iteration  10: 85948,712 ns/op
Iteration  11: 87414,002 ns/op
Iteration  12: 87037,239 ns/op
Iteration  13: 87365,601 ns/op
Iteration  14: 87367,632 ns/op
Iteration  15: 87196,675 ns/op


Result "org.apache.kafka.jmh.fetcher.ReplicaFetcherThreadBenchmark.testFetcher":
  87078,724 ±(99.9%) 640,395 ns/op [Average]
  (min, avg, max) = (85933,541, 87078,724, 88002,498), stdev = 599,026
  CI (99.9%): [86438,329, 87719,119] (assumes normal distribution)


# Run complete. Total time: 00:16:42

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

NOTE: Current JVM experimentally supports Compiler Blackholes, and they are in use. Please exercise
extra caution when trusting the results, look into the generated code to check the benchmark still
works, and factor in a small probability of new VM bugs. Additionally, while comparisons between
different JVMs are already problematic, the performance difference caused by different Blackhole
modes can be very significant. Please make sure you use the consistent Blackhole mode for comparisons.

Benchmark                                  (partitionCount)  Mode  Cnt      Score     Error  Units
ReplicaFetcherThreadBenchmark.testFetcher               100  avgt   15   1894,543 ±  32,917  ns/op
ReplicaFetcherThreadBenchmark.testFetcher               500  avgt   15   8168,382 ± 236,112  ns/op
ReplicaFetcherThreadBenchmark.testFetcher              1000  avgt   15  16024,494 ±  58,477  ns/op
ReplicaFetcherThreadBenchmark.testFetcher              5000  avgt   15  87078,724 ± 640,395  ns/op
JMH benchmarks done

Committer Checklist (excluded from commit message)

  • [ ] Verify design and implementation
  • [ ] Verify test coverage and CI build status
  • [ ] Verify documentation (including upgrade notes)

wernerdv avatar Oct 10 '24 08:10 wernerdv

@mimaison @chia7712 Could you please tell me if ignoring the exception is correct or if we need to look into the cause of the exception?

wernerdv avatar Oct 10 '24 09:10 wernerdv

@mimaison @chia7712 please give me feedback.

wernerdv avatar Oct 17 '24 17:10 wernerdv