charm Replicas slower than separate jobs on GNI systems

Original issue: https://charm.cs.illinois.edu/redmine/issues/1676

On Blue Waters, 200 nodes, 50 replicas (4 nodes per replica) non-smp running apoa1 is uniformly and significantly slower than a separate 4-node run. No idea why. Topology-aware partitioning seems to be working OK. Testing MPI on Bridges to see if it happens there too.

Sep 13 '17 21:09 jcphill

Original date: 2017-09-14 00:20:20

What commit of charm are you using? We recently merged changes to make broadcasts and reductions topology-aware.

Apr 25 '19 02:04 stwhite91

Original date: 2017-09-14 02:33:31

6.8.0 from Sept 5 (v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093). No observed performance difference on Bridges.

Apr 25 '19 02:04 jcphill

Original date: 2017-09-14 02:59:42

I see the exact same performance for v6.7.0-574-g7d61794-namd-charm-6.8.0-build-2017-Jan-23-80737 and v6.7.0-0-g46f867c-namd-charm-6.7.0-build-2015-Dec-21-45876. Definitely not a recent change.

Apr 25 '19 02:04 jcphill

Original date: 2017-09-14 15:45:19

The bug does not affect the MPI layer on Blue Waters. Still need to test verbs.

Apr 25 '19 02:04 jcphill

Original date: 2017-09-15 14:56:44

verbs layer does not appear to be affected.

Apr 25 '19 02:04 jcphill

Original date: 2017-09-15 17:50:58

Just to be clear, the only context in which this bug has been observed is GNI on Blue Waters?

MPI there is unaffected, verbs is unaffected. MPI on Bridges is not affected, I think?

We may not hold the 6.8.1 release for this, since it's not any sort of recent issue or regression. We'll obviously try to get it dealt with quickly.

Apr 25 '19 02:04 PhilMiller

Original date: 2017-09-15 21:33:32

Correct, as far as I know this is a GNI issue. I've only tested on Blue Waters. It may or may not affect Titan, Eos, Edison, Cori, Theta, Piz Daint, etc.

Apr 25 '19 02:04 jcphill

Original date: 2017-10-01 10:10:34

I can confirm that the bug also affects Cori (XC40), so I would assume all XC/XE/XK machines.

Apr 25 '19 02:04 jcphill

Original date: 2017-10-03 16:47:20

Querying test-case reduction, since there are basically no progress notes on this issue:

Is a 2 node, 2 replica job markedly slower than a 1 node single job?
If no to the above, is a 4 node, 2 replica job slower than a 2 node single job?

Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-03 19:14:22

No and no. I've been using 4 nodes per replica, non-smp. The effect starts to be visible above noise at 16 replicas, stands out at 64 replicas.

Apr 25 '19 02:04 jcphill

Original date: 2017-10-03 19:19:49

Ok, so the effect grows in magnitude with replica count, and requires at least a few nodes to occur.

What about the other direction - say 64 or 128 node job with 2 replicas?

Do you have data to know if all of the replicas are slow, or are they mostly fast, and getting delayed by some interaction with one or a few slow replicas?

Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-03 19:32:24

All of the replicas are uniformly slow. There is no inter-replica interaction. I haven't looked at large node counts with small replica counts.

Apr 25 '19 02:04 jcphill

Original date: 2017-10-04 22:13:28

From some basic profiling it appears that the amount of time spent in alloc_mempool_block (but not the number of calls) increases dramatically as the number of nodes and replicas is increased proportionately (from 8 nodes 2 replicas to 16 nodes 4 replicas). I'm getting crashes beyond 16 nodes.

Apr 25 '19 02:04 jcphill

Original date: 2017-10-05 23:08:46

Could you try the same test (4 nodes per replica, increasing replica count) with +useDynamicSmsg? I'm kinda suspecting a lot of memory is being set aside for communication among increasing numbers of nodes, even as the actual communication graph has low fixed degree.

Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-05 23:09:18

And while you're at it, could you post your full command line and the runtime's startup output?

Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-06 02:20:41

For 16 nodes: aprun -n 496 -r 1 -N 31 -d 1 /u/sciteam/jphillip/NAMD_LATEST_CRAY-XE-ugni-BlueWaters/namd2 +pemap 0-30 +useDynamicSmsg +replicas 4 +stdout output/%d/test22.%d.log /u/sciteam/jphillip/apoa1/apoa1.namd

From a 20-node run: Charm++> Running on Gemini (GNI) with 620 processes Charm++> static SMSG Charm++> SMSG memory: 3061.2KB Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit) Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB Charm++> Cray TLB page size: 8192K Charm++> Running in non-SMP mode: numPes 620 Charm++> Using recursive bisection (scheme 3) for topology aware partitions

and the first replica of that run: Converse/Charm++ Commit ID: v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093 CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 0-30 Charm++> Running on 4 unique compute nodes (32-way SMP).

Apr 25 '19 02:04 jcphill

Original date: 2017-10-06 02:59:27

Sorry, dynamic and static SMSG have indistinguishable performance at large replica counts, although the final WallClock output is actually longer for dynamic SMSG than for static SMSG.

Apr 25 '19 02:04 jcphill

Has this been replicated on newer cray architectures?

Apr 09 '20 15:04 ericjbohm

charm charm copied to clipboard

Replicas slower than separate jobs on GNI systems

charm
charm copied to clipboard