kotlinx-benchmark icon indicating copy to clipboard operation
kotlinx-benchmark copied to clipboard

Gradle tasks executing benchmarks succeed even if some benchmarks fail

Open fzhinkin opened this issue 1 year ago • 0 comments
trafficstars

Currently, Gradle tasks executing benchmarks don't fail if some benchmarks fail. That might not be a problem if benchmarks are executed within the IDE, as failure status will be reported explicitly, but in other scenarios, it may lead to failures being unnoticed as generated reports will not contain any hints of failures and the only way to figure out that something went wrong is by inspecting logs.

For example, if benchmarks are executed in CI then, most likely, nobody will check the logs until there's a failure, but since a benchmarking task will succeed in any case and there will also be a report with all benchmarks but a failed one, it may take a long time until somebody will notice a failure.

Here's a reproducer: https://github.com/fzhinkin/kotlinx-benchmark-success-on-benchmark-failure

./gradlew benchmark
> Task :jvmBenchmark
...
<failure>

java.lang.RuntimeException
        at org.example.FaultyBenchmark.thisOneIsNoBetter(FaultyBenchmark.kt:14)
        at org.example.generated.FaultyBenchmark_thisOneIsNoBetter_jmhTest.thisOneIsNoBetter_thrpt_jmhStub(FaultyBenchmark_thisOneIsNoBetter_jmhTest.java:121)
        at org.example.generated.FaultyBenchmark_thisOneIsNoBetter_jmhTest.thisOneIsNoBetter_Throughput(FaultyBenchmark_thisOneIsNoBetter_jmhTest.java:83)
...
> Task :macosArm64Benchmark
...
… org.example.FaultyBenchmark.faulty
  EXCEPTION: kotlin.RuntimeException
0   macosArm64Benchmark.kexe            0x102b6fc73        kfun:org.example.FaultyBenchmark#faulty(){} + 99 
1   macosArm64Benchmark.kexe            0x102b71edb        kfun:kotlinx.benchmark.generated.org.example.FaultyBenchmark_Descriptor.$faulty$FUNCTION_REFERENCE$5.invoke#internal + 23 
...
BUILD SUCCESSFUL in 4m 25s

The build is successful, and the reports contain some results (there's one non-failing benchmark in the demo project), so without inspecting the logs it's hard to detect failures. And even with logs one may decide that everything is fine as the task succeeded.

I am suggesting starting failing Grade tasks if there's at least one failed benchmark.

fzhinkin avatar Feb 06 '24 10:02 fzhinkin