ultimate
ultimate copied to clipboard
Performance Improvements around hashCode and equals
Consistently use reference equality for transitions, events and conditions:
- for performance reasons -- comparing sets is expensive;
- and because in practice two different instances representing "equal" transitions / events / conditions should never be created.
Also, do not use consecutive numbers (e.g. serialNumber
) as hash codes; this can cause frequent hash collisions for sets, maps or lists of transitions / events / conditions.
For the comparison I used by laptop, the materialistic benchmark set, and only differencePairwise.epf
First of all I did three runs on the same commit to demonstrate the usual fluctuation between different benchmark runs.
1st run on dev aa1133877f641a781b40cccc91a0d94af80389c9 Runtime 488s CsvAutomataOperationStatistics9983638_2022-08-26_17-40-29-914.csv
2nd run on dev aa1133877f641a781b40cccc91a0d94af80389c9 Runtime 470s CsvAutomataOperationStatistics1540219992_2022-08-26_17-51-35-185.csv
3rd run on dev aa1133877f641a781b40cccc91a0d94af80389c9 Runtime 462s CsvAutomataOperationStatistics1540219992_2022-08-26_18-04-16-041.csv
Run on 1st commit 273a740b18e21df64684e64ccbd3850467443978 Runtime 336s CsvAutomataOperationStatistics1540219992_2022-08-26_18-16-26-112.csv
Run on 2nd commit 1d81b42cc5a43d74dc5159719fd866fc795d1bb3 Runtime 339s CsvAutomataOperationStatistics1540219992_2022-08-26_18-37-56-775.csv
Run on 3rd commit fa7f963c4f30bd571ddec19f9c087b0c5ed1628c Runtime 343s CsvAutomataOperationStatistics2008907751_2022-08-26_19-36-31-697.csv
Run on 4th commit 0701f203130b18eb03d65f75545292fcbdb27292 Runtime 392s (maybe already slower) CsvAutomataOperationStatistics559559860_2022-08-26_19-52-19-260.csv
Run on 5th commit 138d689f235eb087f136badc52e3485edd6c3b58 Runtime 4045s (!!! no typo) CsvAutomataOperationStatistics1540219992_2022-08-26_21-05-25-416.csv
Run on 6th commit 95b21d5a1843a0e80fed0d417808493c3a870515 Runtime 4249s (!!! no typo) CsvAutomataOperationStatistics2008907751_2022-08-26_21-23-45-603.csv
For the comparison I used by laptop, the materialistic benchmark set, and only differencePairwise.epf
Runtime on 3e9ae16 471s (before transition hashcode bug) CsvAutomataOperationStatistics9983638_2022-08-29_21-56-18-250.csv
Runtime on 8e5ba81897 481s (fixed transition hashcode bug) CsvAutomataOperationStatistics1540219992_2022-08-29_22-12-48-267.csv
Runtime on 0034fd1df9 331s (reference equality Event) CsvAutomataOperationStatistics2008907751_2022-08-29_22-21-58-397.csv
Runtime on a69b4bd23d 333s CsvAutomataOperationStatistics1540219992_2022-08-29_22-49-16-569.csv
Runtime on 0612a5f3b2 332s CsvAutomataOperationStatistics1540219992_2022-08-29_23-03-19-951.csv
Runtime on 1a063de8a0 390s (HashJenkins for Conditions) CsvAutomataOperationStatistics903064416_2022-08-29_23-17-32-749.csv
Runtime on 01280f3148 389s CsvAutomataOperationStatistics559559860_2022-08-29_23-28-39-629.csv
Runtime on c3615b72dd 465s CsvAutomataOperationStatistics1540219992_2022-08-29_23-48-56-993.csv
Runtime on c5bc13a0fd 457s CsvAutomataOperationStatistics1540219992_2022-08-30_00-10-39-906.csv
Runtime on 3902331 446s CsvAutomataOperationStatistics1540219992_2022-08-30_01-42-59-379.csv
I am puzzled and I do not yet have an explanation.
Further analysis part 1: ordinary hashing vs. hashJenkins If I change die computation of hashcodes in 39023316d06d0419909ad6061035b121ff7c46d3 for Transition, Event, and Condition from hashJenkins to a multiplication by 17 the runtime is 463s. (roughly similar to hashJenkins)
Further analysis part 2: Are there hash collisions? I wanted to find out if using serial numbers resp. IDs avoids hash collisions in HashSets and HashMaps. I don't really know how I can do this. Newer Eclipse versions changed the presentation of variable values for Sets/Maps. In the past one could navigate through the original data structures (usually not userfriendly for debugging) now one can only see a list of all elements (usually great but disastrous for my purpose). I randomly watched some lager sets and maps while debugging. Elements are not always perfectly ordered. But sometimes the objects with hash k are at position k which, I guess, hints that there a fewer hash collisions. It's probably confirm this assumption by some experiment.
Further analysis part 3: Are hash collisions costly?
I implemented the following loop.
If DO_HASH
is set the runtime is 209s, otherwise the runtime is 54s.
private static final boolean DO_HASH = false;
public static void main(final String[] args) {
for (int i=0; i<1_000; i++) {
final HashSet<Integer> theSet = new HashSet<>();
for (int j=0; j<1_000_000; j++) {
if (DO_HASH) {
theSet.add(HashUtils.hashJenkins(17, j));
} else {
theSet.add(j);
}
}
}
System.out.println("Done.");
}
I also redid the experiment with LinkedHashSets.
If DO_HASH
is set the runtime is 276s, otherwise the runtime is 86s.
Further analysis part 4: Does the hash code has an effect on the performance in verification?
I run the Svcomp20AutomizerConcurrentSpeedBenchmarks
and compared 8e5ba81 , 0612a5f , and 3902331.
In the following file I interleaved the lines of the IncrementalLogWithBenchmarkResults. interleaved.txt
One cannot see a difference in the runtime.
The benchmark is however only helpful for finding substantial differences in the runtime for the following reasons.
- Slowest solved benchmark takes 16s
- Largest Petri net in solved benchmark has only around 3500 flow.
- Different numbers of iterations hint that different counterexamples were taken.
I ran the SV-COMP Concurrency benchmarks on commits 0612a5f3 (before hashJenkins
) and 39023316 (after hashJenkins
). The results can be found here and do show as significant a difference.
The changes between these commits do not simply introduce hashJenkins
, they also change some things slightly (somewhat bigger Condition
, hashJenkins that does not consider other fields anymore in Transition
).
This benchmark seems to be most affected.
OK, I've investigated one of the benchmarks further (weaver/parallel-min-max-1
). The hash code changes seem to affect both the number of iterations and the time spent per iteration. I tested all combinations of "serial number" hashcodes (s
) and hashJenkins (j
) for transitions, conditions and events:
transition / condition / event / result
-------------------------------------------
s / s / s 11 iterations, 29s
s / s / j 11 iterations, 31s
s / j / s >= 16 iterations, >7min
s / j / j >= 15 iterations, >4min
j / s / s 9 iterations, 15s
j / s / j 9 iterations, 26s
j / j / s 11 iterations, 33s
j / j / j 9 iterations, 85s
The different numbers of iterations seem to suggest that some code, perhaps in the unfolding & selection of counterexamples, is iterating over a HashSet
of events or conditions, perhaps? I'll debug further after lunch.
OK, one problem is that PetriNetUnfolder::constructRun(Event, ConditionMarking)
iterates over event.getPredecessorConditions()
, which is a HashSet
and thus the iteration order depends on the hash codes. As a result, different counterexamples are returned, changing the refinement iterations.
Though that is not the only problem: Even with a fix, I get different counterexamples in later iterations. In this case, the events leading to an accepting condition already have different serial numbers. I did not debug further to find out why.
For now, I'll merge this PR as-is. I'll leave it to you, @Heizmann, if you want to follow up on these issues in the future.