lean-mlir feat: `#evaluation in ...` command for easier evaluation

This PR adds an #evaluation in $cmd command similar to #guard_msgs in $cmd, which collects all messages generated by an arbitrary command, and serializes them into a JSON object. The idea is that this will be a building block for any kind of evaluation we'd like to do, to easily have existing tactics output JSONL without having to modify those tactics.

I was originally planning to use this to convert the existing LLVM evaluation to JSONL, but it seems that evaluation has fundamentally changed. @bollu is the LLVM evaluation superseded by your evaluation?

TODO:

[ ] add a CI job which builds & tests EvaluationHarness

Sep 25 '25 09:09 alexkeizer

@bollu I added the options you mentioned, could you have a look to see if this is usable in its current form, or whether we need more?

Sep 25 '25 09:09 alexkeizer

bv_decide solved 0 theorems. bitwuzla solved 0 theorems. bv_decide found 0 counterexamples. bitwuzla found 0 counterexamples. bv_decide only failed on 0 problems. bitwuzla only failed on 0 problems. both bitwuzla and bv_decide failed on 0 problems. In total, bitwuzla saw 0 problems. In total, bv_decide saw 0 problems. ran rg 'LeanSAT provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS ran rg 'Bitwuzla provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS ran rg 'LeanSAT proved' | wc -l, this file found 0, rg found 0, SUCCESS ran rg 'Bitwuzla proved' | wc -l, this file found 0, rg found 0, SUCCESS The InstCombine benchmark contains 4525 theorems in total. Saved dataframe at: /home/runner/work/lean-mlir/lean-mlir/bv-evaluation/raw-data/InstCombine/instcombine_ceg_data.csv all_files_solved_bitwuzla_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_rw_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_bb_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_sat_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_lratt_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_lratc_times_stddev avg: nan | stddev: nan mean of percentage stddev/av: nan%

Sep 25 '25 09:09 github-actions[bot]

bv_decide solved 0 theorems. bitwuzla solved 0 theorems. bv_decide found 0 counterexamples. bitwuzla found 0 counterexamples. bv_decide only failed on 0 problems. bitwuzla only failed on 0 problems. both bitwuzla and bv_decide failed on 0 problems. In total, bitwuzla saw 0 problems. In total, bv_decide saw 0 problems. ran rg 'LeanSAT provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS ran rg 'Bitwuzla provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS ran rg 'LeanSAT proved' | wc -l, this file found 0, rg found 0, SUCCESS ran rg 'Bitwuzla proved' | wc -l, this file found 0, rg found 0, SUCCESS The InstCombine benchmark contains 4525 theorems in total. Saved dataframe at: /home/runner/work/lean-mlir/lean-mlir/bv-evaluation/raw-data/InstCombine/instcombine_ceg_data.csv all_files_solved_bitwuzla_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_rw_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_bb_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_sat_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_lratt_times_stddev avg: nan | stddev: nan all_files_solved_bv_decide_lratc_times_stddev avg: nan | stddev: nan mean of percentage stddev/av: nan%

Sep 25 '25 13:09 github-actions[bot]

@alexkeizer No, the LLVM evaluation has not been changed by me. What's in the repo is the "latest". I attempted to improve it using snakemake here #1549 , but that's only the hacker's delight portion, and not the InstCombine portion.

Sep 25 '25 16:09 bollu

@alexkeizer No, the LLVM evaluation has not been changed by me. What's in the repo is the "latest". I attempted to improve it using snakemake here #1549 , but that's only the hacker's delight portion, and not the InstCombine portion.

So when I open, say, gzext_proof.lean, I see the following:

theorem test_sext_zext_thm (e : IntW 16) : sext 64 (zext 32 e) ⊑ zext 64 e := by
    simp_alive_undef
    simp_alive_ops
    simp_alive_case_bash
    simp_alive_split
    extract_goals
    all_goals sorry

I assumed the extract_goals there was your doing, and I wonder if that is what's broken the evaluation CI; notice how it now just reports 0 everywhere:

bv_decide solved 0 theorems.
bitwuzla solved 0 theorems.
bv_decide found 0 counterexamples.
bitwuzla found 0 counterexamples.
bv_decide only failed on 0 problems.
bitwuzla only failed on 0 problems.

Sep 26 '25 08:09 alexkeizer