Bug: the LLVM evaluation is broken
For quite a while now, the LLVM Evaluation CI job seems to be broken, and is posting logs like the following to PRs:
bv_decide solved 0 theorems.
bitwuzla solved 0 theorems.
bv_decide found 0 counterexamples.
bitwuzla found 0 counterexamples.
bv_decide only failed on 0 problems.
bitwuzla only failed on 0 problems.
both bitwuzla and bv_decide failed on 0 problems.
In total, bitwuzla saw 0 problems.
In total, bv_decide saw 0 problems.
ran rg 'LeanSAT provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'LeanSAT proved' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla proved' | wc -l, this file found 0, rg found 0, SUCCESS
The InstCombine benchmark contains 4520 theorems in total.
Saved dataframe at: /home/runner/work/lean-mlir/lean-mlir/bv-evaluation/raw-data/InstCombine/instcombine_ceg_data.csv
all_files_solved_bitwuzla_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_rw_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_bb_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_sat_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratt_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratc_times_stddev avg: nan | stddev: nan
mean of percentage stddev/av: nan%
I looked into this a little bit, and noticed that the files which are run in the LLVM evaluation test suite now all have extract_goals in them, which I suspect might be related. @bollu I presume you put those there, could you have a look?
More broadly, I'm of the opinion that posting these logs as comments to PRs just adds a bunch of noise to PR discussions, making them harder to follow, and the fact that this has been broken for quite a while without anyone complaining too loudly shows (to me) the approach of posting logs doesn't work that well. I previously changed the AliveStatements test so it would stop posting the number of solved goals, but instead raised a CI failure when the number changed from the expected value, and I would love for us to do something similar here. It's a bit trickier to do here, I imagine, because we have a certain percentage of time-outs, perhaps we can make this more deterministic by setting some low time-out to account for machine differences, or split the dataset and only run those files in CI which we always expect to be solved or even have the checking code account for noisy results and only cause CI failure when the values are too far outside an expected range.
I would love to refactor these tests. That needs us to reach some sort of consensus that we are willing to refactor the test suite for InstCombine. Given such a consensus, I am happy to refactor the test suite.
I wonder much of of this we want to keep running / alive. It may make sense to drop some tests from CI where the cost/benefit ratio does not make sense.