benchmark `use_ite_for_select`
in interpret.ml, it is set to true for now, if both prove useful, a CLI flag could be offered
So I ran some benchmarks (z3, 30 second timeout with d6fabb2) and here are the results. Current version:
| Nothing | Reached | Timeout | Other | Killed | Total |
|:-------:|:-------:|:-------:|:-----:|:------:|:-----:|
| 1 | 672 | 501 | 37 | 0 | 1211 |
Time stats (in seconds):
| Total | Mean | Median | Min | Max |
|:-----:|:----:|:------:|:---:|:---:|
| 16808.30 | 13.88 | 4.01 | 0.23 | 30.00 |
With use_ite_for_select set to false:
| Nothing | Reached | Timeout | Other | Killed | Total |
|:-------:|:-------:|:-------:|:-----:|:------:|:-----:|
| 1 | 671 | 503 | 36 | 0 | 1211 |
Time stats (in seconds):
| Total | Mean | Median | Min | Max |
|:-----:|:----:|:------:|:---:|:---:|
| 16755.74 | 13.84 | 3.87 | 0.23 | 30.00 |
So the second one is marginally faster, but there are some differences in solved goals, for example:
- testcomp/sv-benchmarks/c/eca-rers2012/Problem18_label00.c solved in 13s & timeout.
- testcomp/sv-benchmarks/c/hardware-verification-bv/btor2c-lazyMod.adding.2.prop1-back-serstep.c solved in 10s & timeout.
- testcomp/sv-benchmarks/c/seq-mthreaded/pals_lcr.5_overflow.ufo.UNBOUNDED.pals.c solved in 20s & timeout
- testcomp/sv-benchmarks/c/loops/insertion_sort-1-2.c solved in 7s & timeout.
- testcomp/sv-benchmarks/c/loops/invert_string-1.c timeout & solved in 0.5s.
- testcomp/sv-benchmarks/c/hardware-verification-bv/btor2c-lazyMod.adding.6.prop1-back-serstep.c timeout & solved in 16s.
There are also tests in which the result is other by one and solved by the other:
- testcomp/sv-benchmarks/c/hardware-verification-bv/btor2c-lazyMod.arbitrated_top_n3_w16_d128_e0.c timeout & other.
- testcomp/sv-benchmarks/c/hardware-verification-bv/btor2c-lazyMod.arbitrated_top_n3_w8_d128_e0.c timeout & other.
- testcomp/sv-benchmarks/c/nla-digbench-scaling/ps4-ll_unwindbound100.c other & solved in 1s
- testcomp/sv-benchmarks/c/nla-digbench-scaling/ps5-ll_unwindbound100.c other & solved in 1s
- testcomp/sv-benchmarks/c/product-lines/email_spec11_productSimulator.cil.c other & solved in 3.5s
There is a slight complementarity between the two, I don't know if its worth adding a flag for it, but it probably needs a more through investigation (and understanding of why there are these changes). But I can add the flag, what do you think @redianthus? Note: I can't reproduce locally (with 1 worker) the results I got from symbocalype, not sure if the differences come from parallelism or some particular settings/command line options that change the behavior of Owi.
Yes, I think adding the flag is worth it.
To get a better idea of the results, could you run this tool on the two results file you got? In particular, I am interested in seeing the execution time variation only on goals solved by both tools. For now I'd say we should set the option to true by default, but I may change my mind after seeing the diff results.
I think there are two interesting questions (to which I don't have an answer :-)):
- Could we write some heuristics to decide if
use_ite_for_selectis interesting, either on a given program or on a givenselectinstruction? - Could we use an
itein other places? (This one is probably easier to answer).
Ok! Here's the output of the diff tool:
tool1 had 672 reached and tool2 had 671 reached
tool1 had 501 timeout and tool2 had 503 timeout
tool1 had 001 nothing and tool2 had 001 nothing
tool1 had 037 other and tool2 had 036 other
tool1 had 000 killed and tool2 had 000 killed
tools have 665 reached tasks in common
tool1 had 007 tasks tool2 did not found
tool2 had 006 tasks tool1 did not found
on commonly reached tasks, tool 1 took 1595.700267 sec. (mean 2.399549, median 0.548621, min 0.306081, max 27.683100) and tool 2 took 1575.438922 sec. (mean 2.369081, median 0.542383, min 0.321276, max 29.703400)
on *not commonly* reached tasks, tool 1 took 131.851030 sec. (mean 18.835861, median 20.340000, min 6.878930, max 28.258600) and tool 2 took 40.464773 sec. (mean 6.744129, median 9.951390, min 0.592530, max 17.833700)
among tasks reached only by tool 1, tool 2 replied 000 nothing, 007 timeout, 00 other and 00 killed
among tasks reached only by tool 2, tool 1 replied 000 nothing, 003 timeout, 03 other and 00 killed
tasks solved only by tool 1:
testcomp/sv-benchmarks/c/xcsp/CostasArray-12.c
testcomp/sv-benchmarks/c/xcsp/AllInterval-016.c
testcomp/sv-benchmarks/c/seq-mthreaded/pals_lcr.5_overflow.ufo.UNBOUNDED.pals.c
testcomp/sv-benchmarks/c/loops/insertion_sort-1-2.c
testcomp/sv-benchmarks/c/hardware-verification-bv/btor2c-lazyMod.bakery.6.prop1-back-serstep.c
testcomp/sv-benchmarks/c/hardware-verification-bv/btor2c-lazyMod.adding.2.prop1-back-serstep.c
testcomp/sv-benchmarks/c/eca-rers2012/Problem18_label00.c
tasks solved only by tool 2:
testcomp/sv-benchmarks/c/xcsp/AllInterval-013.c
testcomp/sv-benchmarks/c/product-lines/email_spec11_productSimulator.cil.c
testcomp/sv-benchmarks/c/nla-digbench-scaling/ps5-ll_unwindbound100.c
testcomp/sv-benchmarks/c/nla-digbench-scaling/ps4-ll_unwindbound100.c
testcomp/sv-benchmarks/c/loops/invert_string-1.c
testcomp/sv-benchmarks/c/hardware-verification-bv/btor2c-lazyMod.adding.6.prop1-back-serstep.c
Which one of tool1/tool2 was with ite/without ite? You can open a PR with the flag (it can just set a global variable, no need to propagate the value through all functions...). Depending on the result, we'll try to choose the best default value.
I added two other issues for the heuristics/other places and we can have a look at them later.
Which one of tool1/tool2 was with ite/without ite?
Tool1 one is the default behavior (with ite) and tool2 is without.
You can open a PR with the flag (it can just set a global variable, no need to propagate the value through all functions...). Depending on the result, we'll try to choose the best default value.
https://github.com/OCamlPro/owi/pull/812
OK, so, I'd say without ite is better then.
Since the time difference between the two is not very large (altough tool2 two seems slightly faster) and since tool1 reaches 672 goals while tool2 reaches 671 ones, it does not seem clear-cut to me, but you know better :)