echidna [Bug-Candidate]: Poor scaling performance with high worker count - 80 workers 96 core cpu slower than 10 workers on a 12 core mac

Describe the issue:

Issue Summary

Echidna appears to show limited performance gains when increasing the number of workers on high-core-count systems. In our tests, running with 80 workers on AMD EPYC 96 core hardware resulted in slower execution times compared to 10 workers on a Mac M2, despite the server offering significantly more cores and comparable single-thread performance.

Performance Results

Local (Mac M2, 10 workers): 72.35 seconds real time, user time 8m59s
Remote (AMD EPYC, 80 workers): 104.12 seconds real time, , user time 90m18s

These results indicate that scaling is not linear, and may be affected by internal bottlenecks—potentially related to garbage collection, as noted in prior discussions with @gustavo-grieco

Configuration

testLimit: 10,000,000
Contract: Simple dummy test contract
Identical Echidna configuration across both environments, differing only in worker count

Expected Behavior

With 8x more workers on similar single-core performance hardware, we anticipated faster execution on the AMD EPYC server, not 44% slower.

Actual Behavior

Instead, increasing the worker count on high core-count CPUs appears to negatively impact performance. This may point to parallelization inefficiencies or Haskell runtime limitations when scaling across many cores.

Environment

Remote: AMD EPYC 9655 (96 cores) - 80 workers, 1.18 TB RAM
Local: Apple M2 (12 cores) - 10 workers, 16gb RAM
Echidna version: 2.2.6

Code example to reproduce the issue:

contract DummyTest {
    uint256 public storedValue;
    address public owner;

    constructor() payable {
        owner = msg.sender;
    }

    function setValues(uint256 _val, address _addr) public {
        storedValue = _val;
        owner = _addr;
    }
}

// Echidna relevant config

testLimit: 10000000
seqLen: 100
shrinkLimit: 1500
shrinkLimit: 1500
coverage: false
format: "text"

Version:

2.2.6

Relevant log output:

Jun 03 '25 12:06 vnmrtz

Can you experiment with different values for the +RTS commands? You will need to run echidna normally but append +RTS ... at the end, with different options. For instance (+RTS -A64m -n4m).

Some options to try:

Diferent values of -A and -n. According to this PR, default options are -A64m -N.
-qc
--numa (in case you hardware supports that)

Check the documentation about these flags here: https://downloads.haskell.org/ghc/latest/docs/users_guide/runtime_control.html#rts-options-to-control-the-garbage-collector

If none of these cause any positive effect, we can investigate other causes.

Jun 05 '25 14:06 gustavo-grieco

Will try this and report back here, thanks

Jun 05 '25 14:06 vnmrtz

Results of the benchmark runs using different +RTS configurations using trailofbits/echidna:testing-dev-hashmap on the same setup (80 and 128 workers):

80 workers:

v2.2.6 release: real 1m42.355s user 89m34.071s sys 68m53.060s
default: real 2m34.985s user 172m46.688s sys 82m45.823s
+RTS -N96 -A64m: real 2m14.713s user 71m4.811s sys 80m50.053s
+RTS -N192 -A128m: real 2m10.418s user 113m29.807s sys 86m25.891s
+RTS -N192 -A256m -n4m: real 1m49.104s user 65m27.358s sys 89m56.903s
+RTS -N192 -A256m -n4m -c: real 2m18.046s user 18m15.003s sys 93m5.549s
+RTS -N192 -A256m -n4m -qg: real 2m19.840s user 17m28.714s sys 96m33.390s
+RTS -N192 -A4g -n128m: real 1m33.985s user 17m47.409s sys 86m7.803s

128 workers:

+RTS -N128 -A4g -n128m: real 1m34.714s user 19m57.739s sys 146m36.802s
+RTS -N192 -A2g -n256m -G2 -RTS: real 1m34.225s user 26m39.562s sys 149m56.185s
+RTS -N192 -A4g -n128m -qn24: real 1m34.670s user 16m33.739s sys 145m55.784s
+RTS -N192 -A4g -n128m -qn48: real 1m34.846s user 17m16.901s sys 145m13.717s
+RTS -N192 -A4g -n128m -qn96: real 1m36.991s user 18m42.528s sys 148m22.465s

Output after passing -s for extra GC information +RTS -N128 -A4g -n128m -qn96 -S:

1,843,537,402,088 bytes allocated in the heap
   2,095,867,968 bytes copied during GC
     103,270,536 bytes maximum residency (4 sample(s))
      26,740,600 bytes maximum slop
          541309 MiB total memory in use (0 MiB lost due to fragmentation)

  Tot time (elapsed)  Avg pause  Max pause
  Gen  0        65 colls,    65 par   144.097s   5.375s     0.0827s    0.1001s
  Gen  1         4 colls,     3 par    2.689s   0.720s     0.1799s    0.3644s

  Parallel GC work balance: 80.18% (serial 0%, perfect 100%)

  TASKS: 387 (1 bound, 386 peak workers (386 total), using -N128)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    2.574s  (  2.568s elapsed)
  MUT     time  10273.174s  ( 94.291s elapsed)
  GC      time  146.786s  (  6.094s elapsed)
  EXIT    time    0.499s  (  0.015s elapsed)
  Total   time  10423.034s  (102.969s elapsed)

  Alloc rate    179,451,581 bytes per MUT second

  Productivity  98.6% of total user, 91.6% of total elapsed

So far no big improvement noticed, despite the different configurations.

Jun 07 '25 20:06 vnmrtz

@vnmrtz have you experimented with multiple --workers values in that box? around which value do you see that adding new workers does not improve performance (or reduces it)?

Jun 09 '25 02:06 elopez