kotlinx-benchmark
kotlinx-benchmark copied to clipboard
Gradle tasks executing benchmarks succeed even if some benchmarks fail
Currently, Gradle tasks executing benchmarks don't fail if some benchmarks fail. That might not be a problem if benchmarks are executed within the IDE, as failure status will be reported explicitly, but in other scenarios, it may lead to failures being unnoticed as generated reports will not contain any hints of failures and the only way to figure out that something went wrong is by inspecting logs.
For example, if benchmarks are executed in CI then, most likely, nobody will check the logs until there's a failure, but since a benchmarking task will succeed in any case and there will also be a report with all benchmarks but a failed one, it may take a long time until somebody will notice a failure.
Here's a reproducer: https://github.com/fzhinkin/kotlinx-benchmark-success-on-benchmark-failure
./gradlew benchmark
> Task :jvmBenchmark
...
<failure>
java.lang.RuntimeException
at org.example.FaultyBenchmark.thisOneIsNoBetter(FaultyBenchmark.kt:14)
at org.example.generated.FaultyBenchmark_thisOneIsNoBetter_jmhTest.thisOneIsNoBetter_thrpt_jmhStub(FaultyBenchmark_thisOneIsNoBetter_jmhTest.java:121)
at org.example.generated.FaultyBenchmark_thisOneIsNoBetter_jmhTest.thisOneIsNoBetter_Throughput(FaultyBenchmark_thisOneIsNoBetter_jmhTest.java:83)
...
> Task :macosArm64Benchmark
...
… org.example.FaultyBenchmark.faulty
EXCEPTION: kotlin.RuntimeException
0 macosArm64Benchmark.kexe 0x102b6fc73 kfun:org.example.FaultyBenchmark#faulty(){} + 99
1 macosArm64Benchmark.kexe 0x102b71edb kfun:kotlinx.benchmark.generated.org.example.FaultyBenchmark_Descriptor.$faulty$FUNCTION_REFERENCE$5.invoke#internal + 23
...
BUILD SUCCESSFUL in 4m 25s
The build is successful, and the reports contain some results (there's one non-failing benchmark in the demo project), so without inspecting the logs it's hard to detect failures. And even with logs one may decide that everything is fine as the task succeeded.
I am suggesting starting failing Grade tasks if there's at least one failed benchmark.