charm icon indicating copy to clipboard operation
charm copied to clipboard

Replicas slower than separate jobs on GNI systems

Open jcphill opened this issue 7 years ago • 18 comments

Original issue: https://charm.cs.illinois.edu/redmine/issues/1676


On Blue Waters, 200 nodes, 50 replicas (4 nodes per replica) non-smp running apoa1 is uniformly and significantly slower than a separate 4-node run. No idea why. Topology-aware partitioning seems to be working OK. Testing MPI on Bridges to see if it happens there too.

jcphill avatar Sep 13 '17 21:09 jcphill

Original date: 2017-09-14 00:20:20


What commit of charm are you using? We recently merged changes to make broadcasts and reductions topology-aware.

stwhite91 avatar Apr 25 '19 02:04 stwhite91

Original date: 2017-09-14 02:33:31


6.8.0 from Sept 5 (v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093). No observed performance difference on Bridges.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-09-14 02:59:42


I see the exact same performance for v6.7.0-574-g7d61794-namd-charm-6.8.0-build-2017-Jan-23-80737 and v6.7.0-0-g46f867c-namd-charm-6.7.0-build-2015-Dec-21-45876. Definitely not a recent change.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-09-14 15:45:19


The bug does not affect the MPI layer on Blue Waters. Still need to test verbs.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-09-15 14:56:44


verbs layer does not appear to be affected.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-09-15 17:50:58


Just to be clear, the only context in which this bug has been observed is GNI on Blue Waters?

MPI there is unaffected, verbs is unaffected. MPI on Bridges is not affected, I think?

We may not hold the 6.8.1 release for this, since it's not any sort of recent issue or regression. We'll obviously try to get it dealt with quickly.

PhilMiller avatar Apr 25 '19 02:04 PhilMiller

Original date: 2017-09-15 21:33:32


Correct, as far as I know this is a GNI issue. I've only tested on Blue Waters. It may or may not affect Titan, Eos, Edison, Cori, Theta, Piz Daint, etc.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-10-01 10:10:34


I can confirm that the bug also affects Cori (XC40), so I would assume all XC/XE/XK machines.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-10-03 16:47:20


Querying test-case reduction, since there are basically no progress notes on this issue:

  • Is a 2 node, 2 replica job markedly slower than a 1 node single job?
  • If no to the above, is a 4 node, 2 replica job slower than a 2 node single job?

PhilMiller avatar Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-03 19:14:22


No and no. I've been using 4 nodes per replica, non-smp. The effect starts to be visible above noise at 16 replicas, stands out at 64 replicas.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-10-03 19:19:49


Ok, so the effect grows in magnitude with replica count, and requires at least a few nodes to occur.

What about the other direction - say 64 or 128 node job with 2 replicas?

Do you have data to know if all of the replicas are slow, or are they mostly fast, and getting delayed by some interaction with one or a few slow replicas?

PhilMiller avatar Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-03 19:32:24


All of the replicas are uniformly slow. There is no inter-replica interaction. I haven't looked at large node counts with small replica counts.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-10-04 22:13:28


From some basic profiling it appears that the amount of time spent in alloc_mempool_block (but not the number of calls) increases dramatically as the number of nodes and replicas is increased proportionately (from 8 nodes 2 replicas to 16 nodes 4 replicas). I'm getting crashes beyond 16 nodes.

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-10-05 23:08:46


Could you try the same test (4 nodes per replica, increasing replica count) with +useDynamicSmsg? I'm kinda suspecting a lot of memory is being set aside for communication among increasing numbers of nodes, even as the actual communication graph has low fixed degree.

PhilMiller avatar Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-05 23:09:18


And while you're at it, could you post your full command line and the runtime's startup output?

PhilMiller avatar Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-06 02:20:41


For 16 nodes: aprun -n 496 -r 1 -N 31 -d 1 /u/sciteam/jphillip/NAMD_LATEST_CRAY-XE-ugni-BlueWaters/namd2 +pemap 0-30 +useDynamicSmsg +replicas 4 +stdout output/%d/test22.%d.log /u/sciteam/jphillip/apoa1/apoa1.namd

From a 20-node run: Charm++> Running on Gemini (GNI) with 620 processes Charm++> static SMSG Charm++> SMSG memory: 3061.2KB Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit) Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB Charm++> Cray TLB page size: 8192K Charm++> Running in non-SMP mode: numPes 620 Charm++> Using recursive bisection (scheme 3) for topology aware partitions

and the first replica of that run: Converse/Charm++ Commit ID: v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093 CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 0-30 Charm++> Running on 4 unique compute nodes (32-way SMP).

jcphill avatar Apr 25 '19 02:04 jcphill

Original date: 2017-10-06 02:59:27


Sorry, dynamic and static SMSG have indistinguishable performance at large replica counts, although the final WallClock output is actually longer for dynamic SMSG than for static SMSG.

jcphill avatar Apr 25 '19 02:04 jcphill

Has this been replicated on newer cray architectures?

ericjbohm avatar Apr 09 '20 15:04 ericjbohm