charm
charm copied to clipboard
Replicas slower than separate jobs on GNI systems
Original issue: https://charm.cs.illinois.edu/redmine/issues/1676
On Blue Waters, 200 nodes, 50 replicas (4 nodes per replica) non-smp running apoa1 is uniformly and significantly slower than a separate 4-node run. No idea why. Topology-aware partitioning seems to be working OK. Testing MPI on Bridges to see if it happens there too.
Original date: 2017-09-14 00:20:20
What commit of charm are you using? We recently merged changes to make broadcasts and reductions topology-aware.
Original date: 2017-09-14 02:33:31
6.8.0 from Sept 5 (v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093). No observed performance difference on Bridges.
Original date: 2017-09-14 02:59:42
I see the exact same performance for v6.7.0-574-g7d61794-namd-charm-6.8.0-build-2017-Jan-23-80737 and v6.7.0-0-g46f867c-namd-charm-6.7.0-build-2015-Dec-21-45876. Definitely not a recent change.
Original date: 2017-09-14 15:45:19
The bug does not affect the MPI layer on Blue Waters. Still need to test verbs.
Original date: 2017-09-15 14:56:44
verbs layer does not appear to be affected.
Original date: 2017-09-15 17:50:58
Just to be clear, the only context in which this bug has been observed is GNI on Blue Waters?
MPI there is unaffected, verbs is unaffected. MPI on Bridges is not affected, I think?
We may not hold the 6.8.1 release for this, since it's not any sort of recent issue or regression. We'll obviously try to get it dealt with quickly.
Original date: 2017-09-15 21:33:32
Correct, as far as I know this is a GNI issue. I've only tested on Blue Waters. It may or may not affect Titan, Eos, Edison, Cori, Theta, Piz Daint, etc.
Original date: 2017-10-01 10:10:34
I can confirm that the bug also affects Cori (XC40), so I would assume all XC/XE/XK machines.
Original date: 2017-10-03 16:47:20
Querying test-case reduction, since there are basically no progress notes on this issue:
- Is a 2 node, 2 replica job markedly slower than a 1 node single job?
- If no to the above, is a 4 node, 2 replica job slower than a 2 node single job?
Original date: 2017-10-03 19:14:22
No and no. I've been using 4 nodes per replica, non-smp. The effect starts to be visible above noise at 16 replicas, stands out at 64 replicas.
Original date: 2017-10-03 19:19:49
Ok, so the effect grows in magnitude with replica count, and requires at least a few nodes to occur.
What about the other direction - say 64 or 128 node job with 2 replicas?
Do you have data to know if all of the replicas are slow, or are they mostly fast, and getting delayed by some interaction with one or a few slow replicas?
Original date: 2017-10-03 19:32:24
All of the replicas are uniformly slow. There is no inter-replica interaction. I haven't looked at large node counts with small replica counts.
Original date: 2017-10-04 22:13:28
From some basic profiling it appears that the amount of time spent in alloc_mempool_block (but not the number of calls) increases dramatically as the number of nodes and replicas is increased proportionately (from 8 nodes 2 replicas to 16 nodes 4 replicas). I'm getting crashes beyond 16 nodes.
Original date: 2017-10-05 23:08:46
Could you try the same test (4 nodes per replica, increasing replica count) with +useDynamicSmsg
? I'm kinda suspecting a lot of memory is being set aside for communication among increasing numbers of nodes, even as the actual communication graph has low fixed degree.
Original date: 2017-10-05 23:09:18
And while you're at it, could you post your full command line and the runtime's startup output?
Original date: 2017-10-06 02:20:41
For 16 nodes: aprun -n 496 -r 1 -N 31 -d 1 /u/sciteam/jphillip/NAMD_LATEST_CRAY-XE-ugni-BlueWaters/namd2 +pemap 0-30 +useDynamicSmsg +replicas 4 +stdout output/%d/test22.%d.log /u/sciteam/jphillip/apoa1/apoa1.namd
From a 20-node run: Charm++> Running on Gemini (GNI) with 620 processes Charm++> static SMSG Charm++> SMSG memory: 3061.2KB Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit) Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB Charm++> Cray TLB page size: 8192K Charm++> Running in non-SMP mode: numPes 620 Charm++> Using recursive bisection (scheme 3) for topology aware partitions
and the first replica of that run: Converse/Charm++ Commit ID: v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093 CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 0-30 Charm++> Running on 4 unique compute nodes (32-way SMP).
Original date: 2017-10-06 02:59:27
Sorry, dynamic and static SMSG have indistinguishable performance at large replica counts, although the final WallClock output is actually longer for dynamic SMSG than for static SMSG.
Has this been replicated on newer cray architectures?