Metatheory.jl Update benchmarks to report the egraph size for each case

... and check he return value assertion only once.

This is useful to compare the behaviour of MT to egg additionally to the runtime.

Feb 18 '25 13:02 gkronber

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 81.42%. Comparing base (32786c0) to head (641212a).

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff            @@
##           ale/3.0     #265   +/-   ##
========================================
  Coverage    81.42%   81.42%           
========================================
  Files           19       19           
  Lines         1491     1491           
========================================
  Hits          1214     1214           
  Misses         277      277

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Feb 18 '25 14:02 codecov-commenter

Benchmark Results

	egg-sym	egg-cust	MT@641212a8e55...	MT@32786c06710...	egg-sym/MT@641...	egg-cust/MT@64...	MT@32786c06710...
egraph_addexpr	1.45 ms		6.16 ms	6.33 ms	0.236		1.03
basic_maths_simpl2	14 ms	5.16 ms	25.8 ms	25.6 ms	0.543	0.2	0.992
prop_logic_freges_theorem	2.55 ms	1.56 ms	2.43 ms	2.41 ms	1.05	0.642	0.991
calc_logic_demorgan	60.8 μs	34.4 μs	81.2 μs	80.2 μs	0.748	0.423	0.988
calc_logic_freges_theorem	24 ms	11.6 ms	49 ms	47.3 ms	0.49	0.237	0.964
basic_maths_simpl1	6.51 ms	2.89 ms	4.93 ms	4.89 ms	1.32	0.587	0.992
egraph_constructor	0.0846 μs		0.106 μs	0.11 μs	0.8		1.04
prop_logic_prove1	37.5 ms	14.6 ms	45.4 ms	43 ms	0.827	0.321	0.949
prop_logic_demorgan	79.9 μs	45.5 μs	98.8 μs	95.5 μs	0.809	0.46	0.967
while_superinterpreter_while_10			19.1 ms	19.9 ms			1.04
prop_logic_rewrite			117 μs	114 μs			0.973
time_to_load			132 ms	128 ms			0.971

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Feb 18 '25 15:02 github-actions[bot]

Using the updated scripts from https://github.com/nmheim/egg-benchmark/pull/10 this should also produce the egraph size. Probably, we still need to update the CI script to extract the second table.

It should look similar to the following. Values in the second table are (num_classes, num_nodes, length(memo)). Edit: updated for 641212a (after fixing inconsistencies in the benchmarks)

|                                 | egg-sym | egg-cust | MT@641212a | MT@ale/3.0 | egg-sym/MT@641... | egg-cust/MT@64... | MT@ale/3.0/MT@... |
|:--------------------------------|:-------:|:--------:|:----------:|:----------:|:-----------------:|:-----------------:|:-----------------:|
| egraph_addexpr                  | 1.62 ms |          | 5.38 ms    | 5.05 ms    | 0.3               |                   | 0.937             |
| basic_maths_simpl2              | 12 ms   | 5.05 ms  | 27.5 ms    | 24.7 ms    | 0.438             | 0.183             | 0.898             |
| prop_logic_freges_theorem       | 2.56 ms | 1.61 ms  | 2.42 ms    | 2.31 ms    | 1.06              | 0.664             | 0.955             |
| calc_logic_demorgan             | 64.1 μs | 36.8 μs  | 80.2 μs    | 78.8 μs    | 0.799             | 0.458             | 0.983             |
| calc_logic_freges_theorem       | 23 ms   | 11.7 ms  | 44.3 ms    | 42.3 ms    | 0.518             | 0.263             | 0.953             |
| basic_maths_simpl1              | 5.77 ms | 2.9 ms   | 4.99 ms    | 4.6 ms     | 1.16              | 0.582             | 0.922             |
| egraph_constructor              | 0.13 μs |          | 0.1 μs     | 0.1 μs     | 1.3               |                   | 1                 |
| prop_logic_prove1               | 38 ms   | 14.6 ms  | 44.7 ms    | 39 ms      | 0.851             | 0.327             | 0.873             |
| prop_logic_demorgan             | 89.9 μs | 49.7 μs  | 98.8 μs    | 97 μs      | 0.91              | 0.503             | 0.982             |
| while_superinterpreter_while_10 |         |          | 19.7 ms    | 18.8 ms    |                   |                   | 0.954             |
| prop_logic_rewrite              |         |          | 131 μs     | 116 μs     |                   |                   | 0.891             |
| time_to_load                    |         |          | 121 ms     | 111 ms     |                   |                   | 0.92              |

[ Info: Saving table at output.md
|                                 | egg-sym          | egg-cust         | MT@641212a       | MT@ale/3.0       | egg-sym/MT@641... | egg-cust/MT@64... | MT@ale/3.0/MT@... |
|:--------------------------------|:----------------:|:----------------:|:----------------:|:----------------:|:-----------------:|:-----------------:|:-----------------:|
| egraph_addexpr                  | 6762 6762 6762   |                  | 6663 6663 6663   | 6640 6640 6640   |                   |                   |                   |
| basic_maths_simpl2              | 440 2235 2836    | 440 2235 2839    | 7092 14530 15001 | 7092 14530 15001 |                   |                   |                   |
| prop_logic_freges_theorem       | 316 1197 2315    | 316 1197 2322    | 297 899 1313     | 297 899 1313     |                   |                   |                   |
| calc_logic_demorgan             | 16 33 35         | 16 33 35         | 15 30 32         | 15 30 32         |                   |                   |                   |
| calc_logic_freges_theorem       | 1072 4289 17394  | 1072 4289 17280  | 1527 5402 18092  | 1527 5402 18092  |                   |                   |                   |
| basic_maths_simpl1              | 368 1910 2543    | 368 1910 2567    | 528 1815 3042    | 528 1815 3042    |                   |                   |                   |
| egraph_constructor              |                  |                  |                  |                  |                   |                   |                   |
| prop_logic_prove1               | 5510 17371 27976 | 4668 13644 18522 | 3061 9176 15637  | 3061 9176 15637  |                   |                   |                   |
| prop_logic_demorgan             | 16 35 42         | 16 35 42         | 15 30 33         | 15 30 33         |                   |                   |                   |
| while_superinterpreter_while_10 |                  |                  | 39 205 499       | 39 205 499       |                   |                   |                   |
| prop_logic_rewrite              |                  |                  |                  |                  |                   |                   |                   |
| time_to_load                    |                  |                  |                  |                  |                   |                   |                   |

~One issue that I couldn't resolve yet, is that for three benchmarks we get a reported size for ale/3.0. This cannot be correct.~ Edit: misunderstanding on my side, we are using simplify from the new benchmark.jl script, and therefore also get size results for other branches.

~I also need to double check whether the hyperparameters and benchmark implementation for egg and MT are equal.~ Edit: done

Feb 18 '25 16:02 gkronber

Nice, thanks a lot! Can we keep the old prove interface by renaming the function that returns also the egraph to something like saturate_prove, then have prove to check if the condition is met and return just the boolean? Also, for the changes to be picked up by the Github CI, we need to merge the benchmarks.jl file into master. I will do it after 3.0 merge

Feb 18 '25 21:02 0x0f0f0f

Can we keep the old prove interface by renaming the function that returns also the egraph to something like saturate_prove, then have prove to check if the condition is met and return just the boolean?

Yes, separating the two prove functions (for tests and benchmarks) is better because then we can always use the prove function from benchmarks.jl to calculate sizes even for other branches.

Feb 19 '25 07:02 gkronber

Can we keep the old prove interface by renaming the function that returns also the egraph to something like saturate_prove, then have prove to check if the condition is met and return just the boolean?

Yes, separating the two prove functions (for tests and benchmarks) is better because then we can always use the prove function from benchmarks.jl to calculate sizes even for other branches.

I think they should call each other (prove should call prove_saturate and return the bool)

Feb 19 '25 07:02 0x0f0f0f

I think they should call each other (prove should call prove_saturate and return the bool)

I tried at first, but it is problematic because the benchmarks.jl script is taken from the reference revision and examples/prove.jl is taken from the benchmarked branch. I think it is better to say that benchmarks.jl has its own prove which is then the same for all branches. (We would also need to move EGraphSize and its methods into examples/prove.jl.)

Feb 19 '25 08:02 gkronber

@gkronber is this ready to merge?

Feb 24 '25 19:02 0x0f0f0f

@0x0f0f0f yes

Feb 25 '25 07:02 gkronber

Metatheory.jl Metatheory.jl copied to clipboard

Update benchmarks to report the egraph size for each case

Codecov Report

Benchmark Results

Benchmark Plots

Metatheory.jl
Metatheory.jl copied to clipboard