[Bug-Candidate]: Poor scaling performance with high worker count - 80 workers 96 core cpu slower than 10 workers on a 12 core mac
Describe the issue:
Issue Summary
Echidna appears to show limited performance gains when increasing the number of workers on high-core-count systems. In our tests, running with 80 workers on AMD EPYC 96 core hardware resulted in slower execution times compared to 10 workers on a Mac M2, despite the server offering significantly more cores and comparable single-thread performance.
Performance Results
- Local (Mac M2, 10 workers): 72.35 seconds real time, user time 8m59s
- Remote (AMD EPYC, 80 workers): 104.12 seconds real time, , user time 90m18s
These results indicate that scaling is not linear, and may be affected by internal bottlenecks—potentially related to garbage collection, as noted in prior discussions with @gustavo-grieco
Configuration
- testLimit: 10,000,000
- Contract: Simple dummy test contract
- Identical Echidna configuration across both environments, differing only in worker count
Expected Behavior
With 8x more workers on similar single-core performance hardware, we anticipated faster execution on the AMD EPYC server, not 44% slower.
Actual Behavior
Instead, increasing the worker count on high core-count CPUs appears to negatively impact performance. This may point to parallelization inefficiencies or Haskell runtime limitations when scaling across many cores.
Environment
- Remote: AMD EPYC 9655 (96 cores) - 80 workers, 1.18 TB RAM
- Local: Apple M2 (12 cores) - 10 workers, 16gb RAM
- Echidna version: 2.2.6
Code example to reproduce the issue:
contract DummyTest {
uint256 public storedValue;
address public owner;
constructor() payable {
owner = msg.sender;
}
function setValues(uint256 _val, address _addr) public {
storedValue = _val;
owner = _addr;
}
}
// Echidna relevant config
testLimit: 10000000
seqLen: 100
shrinkLimit: 1500
shrinkLimit: 1500
coverage: false
format: "text"
Version:
2.2.6
Relevant log output:
Can you experiment with different values for the +RTS commands? You will need to run echidna normally but append +RTS ... at the end, with different options. For instance (+RTS -A64m -n4m).
Some options to try:
- Diferent values of
-Aand-n. According to this PR, default options are-A64m -N. -qc--numa(in case you hardware supports that)
Check the documentation about these flags here: https://downloads.haskell.org/ghc/latest/docs/users_guide/runtime_control.html#rts-options-to-control-the-garbage-collector
If none of these cause any positive effect, we can investigate other causes.
Will try this and report back here, thanks
Results of the benchmark runs using different +RTS configurations using trailofbits/echidna:testing-dev-hashmap on the same setup (80 and 128 workers):
80 workers:
-
v2.2.6 release: real 1m42.355s user 89m34.071s sys 68m53.060s
-
default: real 2m34.985s user 172m46.688s sys 82m45.823s
-
+RTS -N96 -A64m: real 2m14.713s user 71m4.811s sys 80m50.053s
-
+RTS -N192 -A128m: real 2m10.418s user 113m29.807s sys 86m25.891s
-
+RTS -N192 -A256m -n4m: real 1m49.104s user 65m27.358s sys 89m56.903s
-
+RTS -N192 -A256m -n4m -c: real 2m18.046s user 18m15.003s sys 93m5.549s
-
+RTS -N192 -A256m -n4m -qg: real 2m19.840s user 17m28.714s sys 96m33.390s
-
+RTS -N192 -A4g -n128m: real 1m33.985s user 17m47.409s sys 86m7.803s
128 workers:
-
+RTS -N128 -A4g -n128m: real 1m34.714s user 19m57.739s sys 146m36.802s
-
+RTS -N192 -A2g -n256m -G2 -RTS: real 1m34.225s user 26m39.562s sys 149m56.185s
-
+RTS -N192 -A4g -n128m -qn24: real 1m34.670s user 16m33.739s sys 145m55.784s
-
+RTS -N192 -A4g -n128m -qn48: real 1m34.846s user 17m16.901s sys 145m13.717s
-
+RTS -N192 -A4g -n128m -qn96: real 1m36.991s user 18m42.528s sys 148m22.465s
Output after passing -s for extra GC information +RTS -N128 -A4g -n128m -qn96 -S:
1,843,537,402,088 bytes allocated in the heap
2,095,867,968 bytes copied during GC
103,270,536 bytes maximum residency (4 sample(s))
26,740,600 bytes maximum slop
541309 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 65 colls, 65 par 144.097s 5.375s 0.0827s 0.1001s
Gen 1 4 colls, 3 par 2.689s 0.720s 0.1799s 0.3644s
Parallel GC work balance: 80.18% (serial 0%, perfect 100%)
TASKS: 387 (1 bound, 386 peak workers (386 total), using -N128)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 2.574s ( 2.568s elapsed)
MUT time 10273.174s ( 94.291s elapsed)
GC time 146.786s ( 6.094s elapsed)
EXIT time 0.499s ( 0.015s elapsed)
Total time 10423.034s (102.969s elapsed)
Alloc rate 179,451,581 bytes per MUT second
Productivity 98.6% of total user, 91.6% of total elapsed
So far no big improvement noticed, despite the different configurations.
@vnmrtz have you experimented with multiple --workers values in that box? around which value do you see that adding new workers does not improve performance (or reduces it)?