charm ChaNGa crashes/hangs with UCX machine layer (in SMP mode) on Frontera

The 64 node run with the h148 cosmo dataset caused assertion failures at different places. ChaNGa_6.10_debug_64nodes_h148_run.txt is the run output.

The most common assertion failures was:

[3536] Stack Traceback:
[1574806880.828857] [c104-124:313563:0]          ib_md.c:478  UCX  ERROR ibv_reg_mr(address=0x2b4d043eaaf0, length=7472, access=0xf) failed: Cannot allocate memory
[1574806880.829289] [c104-124:313563:0]         ucp_mm.c:111  UCX  ERROR failed to register address 0x2b4d043eaaf0 length 7472 on md[4]=ib/mlx5_0: Input/output error
[1574806880.829292] [c104-124:313563:0]    ucp_request.c:264  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2b4d043eaaf0 len 7472: Input/output error
------------- Processor 3535 Exiting: Called CmiAbort ------------
Reason: [3535] Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 570.

Others were:

Reason: Converse zero handler executed-- was a message corrupted?

[c104-122:253786:0:253845] ib_mlx5_log.c:139  Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[c104-122:253786:0:253845] ib_mlx5_log.c:139  DCI QP 0x29e38 wqe[347]: RDMA_READ s-- [rqpn 0x2944c rlid 7737] [rva 0x2b82a5ee8ab0 rkey 0x744099] [va 0x2acf49804ee0 len 3872 lkey 0x14f64ed]
[715] Stack Traceback:
  [715:0] ChaNGa.smp 0x9ee9aa CmiAbortHelper(char const*, char const*, char const*, int, int)
  [715:1] ChaNGa.smp 0x9eea82 CmiGetNonLocal
  [715:2] ChaNGa.smp 0x9f6202
  [715:3] ChaNGa.smp 0x9f6c26 CmiHandleMessage
  [715:4] ChaNGa.smp 0x9f6feb CsdScheduleForever
  [715:5] ChaNGa.smp 0x9f6f31 CsdScheduler
  [715:6] ChaNGa.smp 0x9ee6f2
  [715:7] ChaNGa.smp 0x9ee2e5 ConverseInit
  [715:8] ChaNGa.smp 0x8abbaa charm_main
  [715:9] ChaNGa.smp 0x8a24b4 main
  [715:10] libc.so.6 0x2acd899df495 __libc_start_main
  [715:11] ChaNGa.smp 0x6c3d8f

------------- Processor 3562 Exiting: Called CmiAbort ------------
Reason: [3562] Assertion "status == UCS_OK" failed in file machine.C line 474.

[3562] Stack Traceback:
  [3562:0] ChaNGa.smp 0x9ee9aa CmiAbortHelper(char const*, char const*, char const*, int, int)
  [3562:1] ChaNGa.smp 0x9eea82 CmiGetNonLocal
  [3562:2] ChaNGa.smp 0x9fc899 CmiCopyMsg
  [3562:3] ChaNGa.smp 0x9f3585 UcxTxReqCompleted(void*, ucs_status_t)
  [3562:4] libucp.so.0 0x2b00f1674f5f ucp_proto_am_zcopy_req_complete
  [3562:5] libuct_ib.so.0 0x2b00f3c218bf uct_rc_txqp_purge_outstanding
  [3562:6] libuct_ib.so.0 0x2b00f3c3c402
  [3562:7] libuct_ib.so.0 0x2b00f3c3bc25
  [3562:8] libucp.so.0 0x2b00f1674122 ucp_worker_progress
  [3562:9] ChaNGa.smp 0x9f37b6 LrtsAdvanceCommunication(int)

@brminich mentioned that assertion failue Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 570. happens because of memory allocation failure.

Dec 04 '19 19:12 nitbhat

With just export UCX_ZCOPY_THRESH=-1, I saw a hang during the initial domain decomposition

Charm++> Running in SMP mode: 64 processes, 55 worker threads (PEs) + 1 comm threads per process, 3520 PEs total
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.10.0-rc2-20-g55984a468
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 1-55
Charm++> Running on 64 hosts (2 sockets x 28 cores x 1 PUs = 56-way SMP)
Charm++> cpu topology info is gathered in 0.079 seconds.
[0] MultistepLB_notopo created
WARNING: bKDK parameter ignored; KDK is always used.
WARNING: bStandard parameter ignored; Output is always standard.
Not Using CkLoop 0
WARNING: bCannonical parameter ignored; integration is always cannonical
WARNING: bOverwrite parameter ignored.
WARNING: bGeometric parameter ignored.
WARNING: star formation set without enabling SPH
Enabling SPH
ChaNGa version 3.4, commit v3.4-10-gf74cbe8
Running on 3520 processors/ 64 nodes with 262144 TreePieces
yieldPeriod set to 5
Prefetching...ON
Number of chunks for remote tree walk set to 1
Chunk Randomization...ON
cache 1
cacheLineDepth 4
Verbosity level 1
Domain decomposition...SFC Peano-Hilbert
CkLoopLib is used in SMP with simple dynamic scheduling (converse-level notification)
Created 262144 pieces of tree
Loading particles ... trying Tipsy ... took 11.011082 seconds.
N: 546538574
Input file, Time:0.286600 Redshift:0.140216 Expansion factor:0.877027
Simulation to Time:0.287039 Redshift:0.138666 Expansion factor:0.878220
Reading coolontime
Restarting Gas Simulation with array files.
dDeltaStarForm (set): 2.39444e-05, effectively: 8.06294e-07 = 33673.5 yrs, iStarFormRung: 0
SNII feedback: 1.49313e+49 ergs/solar mass
dDeltaSink (set): 0, effectively: 8.06294e-07 = 33673.5 yrs, iSinkRung: 0
Identified 15 sink particles
Initial Domain decomposition ... Sorter: Histograms balanced after 25 iterations

With both export UCX_ZCOPY_THRESH=-1 and +ucx_rndv_thresh=2048, ChaNGa ran 2 big steps (3553 and 3554), but didn't complete in 30 mins and was hence killed by the scheduler.

@trquinn was seeing another hang during the load balancing phase but mentioned that setting UCX_IB_RX_MAX_BUFS=32768 helped in getting past that hang on 256 nodes.

Dec 04 '19 19:12 nitbhat

@nitbhat, can you please try UCX_IB_RX_MAX_BUFS=32768 without any other settings? if it does not help, smaller value is worth trying (say 8192)

Dec 04 '19 19:12 brminich

@nitbhat, can you please try UCX_IB_RX_MAX_BUFS=32768 without any other settings? if it does not help, smaller value is worth trying (say 8192)

Okay, I'll try that.

Dec 04 '19 20:12 nitbhat

@brminich: I haven't been able to test that setting yet. (Frontera was down for maintenance on Tuesday and now for some reason, I'm getting weird errors while launching the MPI job. I'm in conversation with TACC about it).

I'll test it as soon as I can.

Dec 05 '19 17:12 nitbhat

@brminich: I tried different values for UCX_IB_RX_MAX_BUFS from 32k to 2k, and I got the same error.

For the case when I set UCX_IB_RX_MAX_BUFS to 2048, I saw this warning: [1575585217.410997] [c101-081:277019:0] uct_iface.c:139 UCX WARN Memory pool rc_recv_desc is empty

Dec 05 '19 22:12 nitbhat

Following the suggestion in #2635 , I tried running ChaNGa with the master branch of ucx. With dwf1b running on 2 nodes/4 processes, I get the failure:

[1589667781.056885] [c161-001:28740:0]          ib_md.c:329  UCX  ERRO
R ibv_exp_reg_mr(address=0x2ab3f788a0a0, length=18960, access=0xf) failed: Cannot allocate memory
[1589667781.056924] [c161-001:28740:0]         ucp_mm.c:131  UCX  ERROR failed to register address 0x2ab3f788a0a0 mem_type bit
 0x1 length 18948 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x15)
[1589667781.056927] [c161-001:28740:0]    ucp_request.c:275  UCX  ERROR failed to register user buffer datatype 0x8 address 0x
2ab3f788a0a0 len 18948: Input/output error
[c161-001:28740:0:28805]        rndv.c:457  Assertion `status == UCS_OK' failed

/home1/00333/tg456090/src/ucx/src/ucp/tag/rndv.c: [ ucp_rndv_progress_rma_get_zcopy() ]

This run works for ucx releases 1.6.1, 1.7 and 1.8.0. Bisection says the failure starts to happen at ucx git hash 35f6d1189c410aa06a3c8f5fb18805527da91cf7 (although this fails with a seg fault earlier; the change to the registered memory issue happens at 896d76b8762bc5d54f8f74fbc805a25ed404d055) My build script (starting in the ucx directory) is:

./autogen.sh
./contrib/configure-release-mt --prefix=$HOME/ucx/build_master
make clean
make -j16 install
cd ../charm
rm -rf ucx-linux-x86_64-smp
./build ChaNGa ucx-linux-x86_64 smp --with-production --basedir=$HOME/ucx/build_master -j16
cd ../changa
make clean
make -j 16

May 17 '20 05:05 trquinn

@trquinn are there any errors in dmesg on the machine which failed to register memory?

May 18 '20 12:05 yosefe

@trquinn: How can I get access to the dwf1b benchmark? Is it the same as dwf1.6144 as listed in https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks?

Jun 30 '20 15:06 nitbhat

I tried the h148.cosmo50PLK.6144g3HbwK1BH.param benchmark on 64 nodes with 2 processes/node, built on charm that was built using ucx master, and I see the crash which is the same as the original crash that was seen back in 2019.


 121 dDeltaStarForm (set): 2.39444e-05, effectively: 8.06294e-07 = 33673.5 yrs, iStarFormRung: 0$
 122 SNII feedback: 1.49313e+49 ergs/solar mass$
 123 dDeltaSink (set): 0, effectively: 8.06294e-07 = 33673.5 yrs, iSinkRung: 0$
 124 Identified 15 sink particles$
 125 Initial Domain decomposition ... Sorter: Histograms balanced after 25 iterations.$
 126 [1593463314.947122] [c186-092:226510:0]          ib_md.c:329  UCX  ERROR ibv_exp_reg_mr(address=0x2ac99492d8c0, length=2096, access=0xf) failed: Cannot allocate memory$
 127 [1593463314.947795] [c186-092:226510:0]         ucp_mm.c:131  UCX  ERROR failed to register address 0x2ac99492d8c0 mem_type bit 0x1 length 2096 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x15)$
 128 [1593463314.947799] [c186-092:226510:0]    ucp_request.c:275  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2ac99492d8c0 len 2096: Input/output error$
 129 ------------- Processor 3580 Exiting: Called CmiAbort ------------$
 130 Reason: [3580] Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 583.$
 131 $
 132 [3580] Stack Traceback:$
 133   [3580:0] ChaNGa.smp 0x9ff860 CmiAbortHelper(char const*, char const*, char const*, int, int)$
 134   [3580:1] ChaNGa.smp 0x9ff938 CmiGetNonLocal$
 135   [3580:2] ChaNGa.smp 0xa0d779 CmiCopyMsg$
 136   [3580:3] ChaNGa.smp 0xa045bb $
 137   [3580:4] ChaNGa.smp 0xa046cb LrtsAdvanceCommunication(int)$
 138   [3580:5] ChaNGa.smp 0x9ff5e0 $
 139   [3580:6] ChaNGa.smp 0x9ff5f8 $
 140   [3580:7] ChaNGa.smp 0x9ff663 CommunicationServerThread(int)$
 141   [3580:8] ChaNGa.smp 0x9ff56c $
 142   [3580:9] ChaNGa.smp 0x9fc39c $
 143   [3580:10] libpthread.so.0 0x2ac853994dd5 $
 144   [3580:11] libc.so.6 0x2ac854b9002d clone$

@trquinn: Have you tried running ChaNGa based on ucx master to reproduce the occasional hangs that you saw during load balancing?

Jun 30 '20 15:06 nitbhat

I get past the registration error when I run with the nonsmp version. However, the run crashes after step 3553.875.


33237 Step: 3553.875000 Time: 0.286602 Rungs 3 to 4. Gravity Active: 65718, Gas Active: 65703$
33238 Domain decomposition ... total 0.49349 seconds.$
33239 Skipped DD$
33240 Load balancer ... Orb3dLB_notopo: Step 16$
33241 numActiveObjects: 12604, numInactiveObjects: 249540$
33242 active PROC range: 43 to 3563$
33243 [Orb3dLB_notopo] sorting$
33244 ***************************$
33245 Orb3dLB_notopo stats: maxObjLoad 1.318802$
33246 Orb3dLB_notopo stats: minWall 0.000735 maxWall 12.320339 avgWall 11.987582 maxWall/avgWall 1.027758$
33247 Orb3dLB_notopo stats: minIdle 0.000001 maxIdle 11.885347 avgIdle 10.641879 minIdle/avgIdle 0.000000$
33248 Orb3dLB_notopo stats: minPred 0.000000 maxPred 1.318946 avgPred 0.115882 maxPred/avgPred 11.381776$
33249 Orb3dLB_notopo stats: minPiece 0.000000 maxPiece 594.000000 avgPiece 73.142857 maxPiece/avgPiece 8.121094$
33250 Orb3dLB_notopo stats: minBg 0.000734 maxBg 0.457804 avgBg 0.112647 maxBg/avgBg 4.064063$
33251 Orb3dLB_notopo stats: orb migrated 12597 refine migrated 0 objects$
33252 took 0.260109 seconds.$
33253 Building trees ... took 0.801881 seconds.$
33254 Calculating gravity (tree bucket, theta = 0.900000) ... Calculating densities/divv on Actives ... took 0.381756 seconds.$
33255 Marking Neighbors ... took 0.529504 seconds.$
33256 ------------- Processor 257 Exiting: Called CmiAbort ------------$
33257 Reason: [257] Assertion "status == UCS_OK" failed in file machine.C line 487.$
33258 $
33259 [257] Stack Traceback:$
33260   [257:0] ChaNGa.smp 0x9e5600 CmiAbortHelper(char const*, char const*, char const*, int, int)$
33261   [257:1] ChaNGa.smp 0x9e56d8 CmiGetNonLocal$
33262   [257:2] ChaNGa.smp 0x9efc2c CmiCopyMsg$
33263   [257:3] ChaNGa.smp 0x9ea0b0 UcxTxReqCompleted(void*, ucs_status_t)$
33264   [257:4] libucp.so.0 0x2afea291bf82 ucp_proto_am_zcopy_req_complete$
33265   [257:5] libuct_ib.so.0 0x2afea2a3bc36 uct_rc_txqp_purge_outstanding$
33266   [257:6] libuct_ib.so.0 0x2afea2a5812d uct_dc_mlx5_ep_handle_failure$
33267   [257:7] libuct_ib.so.0 0x2afea2a59162 $
33268   [257:8] libucp.so.0 0x2afea291a63a ucp_worker_progress$
33269   [257:9] ChaNGa.smp 0x9ea18e LrtsAdvanceCommunication(int)$
33270   [257:10] ChaNGa.smp 0x9e5441 $
33271   [257:11] ChaNGa.smp 0x9e5811 $
33272   [257:12] ChaNGa.smp 0x9f0a0a $
33273   [257:13] ChaNGa.smp 0x9f174a CcdRaiseCondition$
33274   [257:14] ChaNGa.smp 0x9ec55a CsdStillIdle$
33275   [257:15] ChaNGa.smp 0x9ec8d9 CsdScheduleForever$
33276   [257:16] ChaNGa.smp 0x9ec7e4 CsdScheduler$
33277   [257:17] ChaNGa.smp 0x9e5409 $
33278   [257:18] ChaNGa.smp 0x9e5323 ConverseInit$
33279   [257:19] ChaNGa.smp 0x89edd6 charm_main$
33280   [257:20] ChaNGa.smp 0x897804 main$
33281   [257:21] libc.so.6 0x2afea3e4b495 __libc_start_main$
33282   [257:22] ChaNGa.smp 0x6c19af $

The assertion failure happens in the send completion callback and the failure indicates that one of the sends didn't complete successfully.

483 void UcxTxReqCompleted(void *request, ucs_status_t status)$
484 {$
485     UcxRequest *req = (UcxRequest*)request;$
486 $
487     CmiEnforce(status == UCS_OK);$
488     CmiEnforce(req->msgBuf);$
489 $
490     UCX_LOG(3, "TX req %p completed, free msg %p", req, req->msgBuf);$
491     CmiFree(req->msgBuf);$
492     UCX_REQUEST_FREE(req);$
493 }$

I'm guessing that the status object can be queried to understand more about why the send failed.

Jun 30 '20 16:06 nitbhat

@trquinn: How can I get access to the dwf1b benchmark? Is it the same as dwf1.6144 as listed in https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks?

Correct: that benchmark can be downloaded from google drive.

Jun 30 '20 17:06 trquinn

I tried the h148.cosmo50PLK.6144g3HbwK1BH.param benchmark on 64 nodes with 2 processes/node, built on charm that was built using ucx master, and I see the crash which is the same as the original crash that was seen back in 2019.


 121 dDeltaStarForm (set): 2.39444e-05, effectively: 8.06294e-07 = 33673.5 yrs, iStarFormRung: 0$
 122 SNII feedback: 1.49313e+49 ergs/solar mass$
 123 dDeltaSink (set): 0, effectively: 8.06294e-07 = 33673.5 yrs, iSinkRung: 0$
 124 Identified 15 sink particles$
 125 Initial Domain decomposition ... Sorter: Histograms balanced after 25 iterations.$
 126 [1593463314.947122] [c186-092:226510:0]          ib_md.c:329  UCX  ERROR ibv_exp_reg_mr(address=0x2ac99492d8c0, length=2096, access=0xf) failed: Cannot allocate memory$
 127 [1593463314.947795] [c186-092:226510:0]         ucp_mm.c:131  UCX  ERROR failed to register address 0x2ac99492d8c0 mem_type bit 0x1 length 2096 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x15)$
 128 [1593463314.947799] [c186-092:226510:0]    ucp_request.c:275  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2ac99492d8c0 len 2096: Input/output error$
 129 ------------- Processor 3580 Exiting: Called CmiAbort ------------$
 130 Reason: [3580] Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 583.$
 131 $
 132 [3580] Stack Traceback:$
 133   [3580:0] ChaNGa.smp 0x9ff860 CmiAbortHelper(char const*, char const*, char const*, int, int)$
 134   [3580:1] ChaNGa.smp 0x9ff938 CmiGetNonLocal$
 135   [3580:2] ChaNGa.smp 0xa0d779 CmiCopyMsg$
 136   [3580:3] ChaNGa.smp 0xa045bb $
 137   [3580:4] ChaNGa.smp 0xa046cb LrtsAdvanceCommunication(int)$
 138   [3580:5] ChaNGa.smp 0x9ff5e0 $
 139   [3580:6] ChaNGa.smp 0x9ff5f8 $
 140   [3580:7] ChaNGa.smp 0x9ff663 CommunicationServerThread(int)$
 141   [3580:8] ChaNGa.smp 0x9ff56c $
 142   [3580:9] ChaNGa.smp 0x9fc39c $
 143   [3580:10] libpthread.so.0 0x2ac853994dd5 $
 144   [3580:11] libc.so.6 0x2ac854b9002d clone$

@trquinn: Have you tried running ChaNGa based on ucx master to reproduce the occasional hangs that you saw during load balancing?

Yes, and I got similar errors as you.

Jun 30 '20 17:06 trquinn

@trquinn I was able to reproduce the memory registration error that you were seeing on 2 nodes/4 processes while running the dwf1b benchmark.

[Orb3dLB_notopo] sorting
***************************
Orb3dLB_notopo stats: maxObjLoad 0.000342
Orb3dLB_notopo stats: minWall 0.002322 maxWall 0.004742 avgWall 0.002872 maxWall/avgWall 1.650815
Orb3dLB_notopo stats: minIdle 0.000000 maxIdle 0.000440 avgIdle 0.000033 minIdle/avgIdle 0.000000
Orb3dLB_notopo stats: minPred 0.000737 maxPred 0.001203 avgPred 0.000920 maxPred/avgPred 1.308291
Orb3dLB_notopo stats: minPiece 7.000000 maxPiece 9.000000 avgPiece 8.000000 maxPiece/avgPiece 1.125000
Orb3dLB_notopo stats: minBg 0.001020 maxBg 0.004558 avgBg 0.001920 maxBg/avgBg 2.374152
Orb3dLB_notopo stats: orb migrated 862 refine migrated 0 objects
Building trees ... took 0.302196 seconds.
Calculating gravity (tree bucket, theta = 0.700000) ... [1593618510.434801] [c191-041:155490:0]          ib_md.c:329  UCX  ERROR ibv_exp_reg_mr(address=0x2ad52577de20, length=9536, access=0xf) failed: Cannot allocate memory
[1593618510.434888] [c191-041:155490:0]         ucp_mm.c:131  UCX  ERROR failed to register address 0x2ad52577de20 mem_type bit 0x1 length 9524 on md[5]=mlx5_0: Input/output error (md reg_mem_types 0x15)
[1593618510.434900] [c191-041:155490:0]    ucp_request.c:275  UCX  ERROR failed to register user buffer datatype 0x8 address 0x2ad52577de20 len 9524: Input/output error
[c191-041:155490:0:155554]        rndv.c:523  Assertion `status == UCS_OK' failed

/scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c: [ ucp_rndv_progress_rma_get_zcopy() ]
      ...
      520             }
      521             return UCS_OK;
      522         } else if (!UCS_STATUS_IS_ERR(status)) {
==>   523             /* in case if not all chunks are transmitted - return in_progress
      524              * status */
      525             return UCS_INPROGRESS;
      526         } else {

==== backtrace (tid: 155554) ====
 0 0x0000000000052563 ucs_debug_print_backtrace()  /scratch1/03808/nbhat4/ucx/src/ucs/debug/debug.c:656
 1 0x0000000000035f81 ucp_rndv_progress_rma_get_zcopy()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:523
 2 0x0000000000036420 ucp_request_try_send()  /scratch1/03808/nbhat4/ucx/src/ucp/core/ucp_request.inl:213
 3 0x0000000000036420 ucp_request_send()  /scratch1/03808/nbhat4/ucx/src/ucp/core/ucp_request.inl:248
 4 0x0000000000036420 ucp_rndv_req_send_rma_get()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:652
 5 0x0000000000037d5e ucp_rndv_matched()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:1131
 6 0x0000000000038155 ucp_rndv_process_rts()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:1185
 7 0x0000000000038155 ucp_rndv_process_rts()  /scratch1/03808/nbhat4/ucx/src/ucp/tag/rndv.c:1189
 8 0x0000000000014715 uct_iface_invoke_am()  /scratch1/03808/nbhat4/ucx/src/uct/base/uct_iface.h:635
 9 0x0000000000014715 uct_mm_iface_process_recv()  /scratch1/03808/nbhat4/ucx/src/uct/sm/mm/base/mm_iface.c:232
10 0x0000000000014715 uct_mm_iface_poll_fifo()  /scratch1/03808/nbhat4/ucx/src/uct/sm/mm/base/mm_iface.c:280
11 0x0000000000014715 uct_mm_iface_progress()  /scratch1/03808/nbhat4/ucx/src/uct/sm/mm/base/mm_iface.c:333
12 0x000000000002663a ucs_callbackq_dispatch()  /scratch1/03808/nbhat4/ucx/src/ucs/datastruct/callbackq.h:211
13 0x000000000002663a uct_worker_progress()  /scratch1/03808/nbhat4/ucx/src/uct/api/uct.h:2342
14 0x000000000002663a ucp_worker_progress()  /scratch1/03808/nbhat4/ucx/src/ucp/core/ucp_worker.c:2037
15 0x00000000009e571e LrtsAdvanceCommunication()  ???:0
16 0x00000000009e0680 AdvanceCommunication()  machine.C:0
17 0x00000000009e0698 CommunicationServer()  machine.C:0
18 0x00000000009e0703 CommunicationServerThread()  ???:0
19 0x00000000009e060c ConverseRunPE()  machine.C:0
20 0x00000000009dd43c call_startfn()  machine.C:0
21 0x0000000000007dd5 start_thread()  pthread_create.c:0
22 0x00000000000fe02d __clone()  ???:0
=================================

@yosefe: On seeing the dmesg output, I don't particularly see any errors related to memory registration. I'm attaching the dmesg output from both the nodes after the crash occurred.

dmesg_output_1009912_c191-034.txt dmesg_output_1009912_c191-041.txt

Jul 01 '20 16:07 nitbhat

@nitbhat, are you running on Frontera? can you please share the details on running the benchmark on 2 nodes? I'll try to reproduce locally, since I do not have an access to frontera

Jul 02 '20 13:07 brminich

@brminich: Yes, I was running that on Frontera.

Sure.

Build charm (ChaNGa target) using ./build ChaNGa ucx-linux-x86_64 smp --enable-error-checking --suffix=debug --basedir=<path-to-ucx> -j24 -g -O0
Download ChaNGa from https://github.com/N-BodyShop/changa
Download utility from https://github.com/N-BodyShop/utility and clone it in the ChaNGa repo's parent directory.
Export CHARM_DIR to point to your charm build arch (export CHARM_DIR=/work/03808/nbhat4/frontera/charm_3/ucx-linux-x86_64-smp-debug)
Inside the ChaNGa directory, run ./configure(Make sure that Charm path printed at the end of configure points to the correct charm directory).
Build running make -j<num procs>. This should create ChaNGa.smp
Get the benchmarking files (dwf1.6144.param and dwf1.6144.01472) from https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks.
On Frontera, the bug shows up when we run on 2 nodes (with 2 processes on each node). Each Frontera node has 56 cores, you can try something matching that configuration on your machine. The run script used on Frontera is as follows:

#!/bin/bash
#SBATCH -J changa_2nodes_4procs
#SBATCH -p normal
#SBATCH -t 00:30:00
#SBATCH -A ASC20007
#SBATCH -N 2
#SBATCH -n 4
#SBATCH --ntasks-per-node=2
#SBATCH -o /scratch1/03808/nbhat4/changa/results/changa-output-2nodes-ucx-prod-2procs_6.10.1-master-%A.out

cd /scratch1/03808/nbhat4/changa

ibrun ./ChaNGa.smp +ppn 27 +setcpuaffinity +commap 0,1 +pemap 2-54:2,3-55:2 dwf1.6144.param

Let me know if you have any questions. (and if you are/aren't able to reproduce the crash).

Jul 02 '20 18:07 nitbhat

@nitbhat, thanks for the instructions. is it 100% reproducible on frontera? how long does it typically take to fail? I managed to run it on local system, with 28 threads per node, but it seems to be running for quite long time (until my reservation ended) Is it reproducible with non-SMP mode?

Jul 07 '20 14:07 brminich

@brminich

Yes, it crashes every time I run on Frontera. It takes about 14 mins to crash.

How many nodes did you run it on? 4 nodes?

On trying with non-SMP, it seems like there is an issue with memory, since I see this error:


 17 Initial Domain decomposition ... total 0.685047 seconds.$
 18 Initial load balancing ... Orb3dLB_notopo: Step 0$
 19 numActiveObjects: 896, numInactiveObjects: 0$
 20 active PROC range: 0 to 111$
 21 Migrating all: numActiveObjects: 896, numInactiveObjects: 0$
 22 [Orb3dLB_notopo] sorting$
 23 ***************************$
 24 Orb3dLB_notopo stats: maxObjLoad 0.000079$
 25 Orb3dLB_notopo stats: minWall 0.000279 maxWall 0.003832 avgWall 0.000472 maxWall/avgWall 8.112413$
 26 Orb3dLB_notopo stats: minIdle 0.000000 maxIdle 0.000242 avgIdle 0.000032 minIdle/avgIdle 0.000000$
 27 Orb3dLB_notopo stats: minPred 0.000017 maxPred 0.000128 avgPred 0.000051 maxPred/avgPred 2.523048$
 28 Orb3dLB_notopo stats: minPiece 7.000000 maxPiece 9.000000 avgPiece 8.000000 maxPiece/avgPiece 1.125000$
 29 Orb3dLB_notopo stats: minBg 0.000237 maxBg 0.003799 avgBg 0.000389 maxBg/avgBg 9.754210$
 30 Orb3dLB_notopo stats: orb migrated 893 refine migrated 0 objects$
 31 Building trees ... took 0.12406 seconds.$
 32 Calculating gravity (tree bucket, theta = 0.700000) ... ------------- Processor 70 Exiting: Called CmiAbort ------------$
 33 Reason: Unhandled C++ exception in user code.$
 34 $
 35 ------------- Processor 87 Exiting: Called CmiAbort ------------$
 36 Reason: Unhandled C++ exception in user code.$
 37 $
 38 [87] Stack Traceback:$
 39   [87:0] ChaNGa_ucx_nonsmp 0x7f64ae _Z14CmiAbortHelperPKcS0_S0_ii$
 40   [87:1] ChaNGa_ucx_nonsmp 0x7f65c6 $
 41   [87:2] ChaNGa_ucx_nonsmp 0x701c67 $
 42   [87:3] libstdc++.so.6 0x2b39bdf7f106 $
 43   [87:4] libstdc++.so.6 0x2b39bdf7f151 $
 44   [87:5] libstdc++.so.6 0x2b39bdf7f385 $
 45   [87:6] libstdc++.so.6 0x2b39bdf73301 $
 46 ------------- Processor 78 Exiting: Called CmiAbort ------------$
 47 Reason: Could not malloc()--are we out of memory? (used: 2074.969MB)$
 48 [78] Stack Traceback:$
 49   [78:0] ChaNGa_ucx_nonsmp 0x7f64ae CmiAbortHelper(char const*, char const*, char const*, int, int)$
 50   [78:1] ChaNGa_ucx_nonsmp 0x7f65c6 $

Jul 07 '20 19:07 nitbhat

I was running on 2 nodes with 4 processes. Moving to thor, maybe can catch it there. Can it be that with SMP lack of memory is also an issue? Is there any memory consumption estimation for that example?

Jul 07 '20 20:07 brminich

Okay, I think you can run it on 4 nodes (with 28 cores each) to better suit the 2 Frontera nodes (with 56 cores each).

Yes, in some runs, I saw similar errors 'Could not malloc - are we out of memory' from an SMP 2 node run as well. So, in the nonsmp case, it's always that error. In the smp run, I sometimes see that error and some other times, I see the error related to memory registration.

However, running it on 2 nodes with the MPI layer (both smp and nonsmp) doesn't crash, but takes longer to complete.

Interestingly, when I try increasing the number of nodes (to 4 and 8), I still see "out of memory" errors for UCX. (And MPI runs successfully for those cases as well). I'll try to determine the exact memory usage for UCX runs on 2/4/8 nodes.

Jul 08 '20 22:07 nitbhat

I checked on expected memory use: when running on a single SMP process, this benchmark uses 16.3GB. Using netlrts with 4 SMP processes, the benchmark uses 5.3GB/process (i.e., ~22GB total).

Jul 08 '20 22:07 trquinn

@nitbhat, maybe we can have joint debug session on Frontera?

Jul 18 '20 16:07 brminich

They just upgraded the ofed libraries (and the system installed UCX) on Frontera. We should see if that makes a difference first. I'm in meetings until 14:30 PDT all next week.

Jul 18 '20 16:07 trquinn

@brminich Yes, let's schedule a debugging session sometime next week if that works for you?

@trquinn Okay, I can check if that is making a difference.

Aug 06 '20 16:08 nitbhat

@nitbhat, next week is ok

Aug 06 '20 16:08 brminich

Note that a similar issue is reported on in the UCX repository: https://github.com/openucx/ucx/issues/5291

Aug 22 '20 16:08 trquinn

I've done a little more investigation on frontera, using the master branches of ucx and charm (as of Aug. 18), and the dwf1b benchmark, running 8 processors on 4 nodes. I instrumented uct_ib_reg_mr() to see how much memory was being registered with the following patch:

diff --git a/src/uct/ib/base/ib_md.c b/src/uct/ib/base/ib_md.c
index 08443d0b0..592fe839f 100644
--- a/src/uct/ib/base/ib_md.c
+++ b/src/uct/ib/base/ib_md.c
@@ -516,6 +516,9 @@ static ucs_status_t uct_ib_md_reg_mr(uct_ib_md_t *md, void *address,
                             silent);
 }
 
+static size_t ib_total_reg = 0;
+static size_t ib_total_reg_segs = 0;
+
 ucs_status_t uct_ib_reg_mr(struct ibv_pd *pd, void *addr, size_t length,
                            uint64_t access_flags, struct ibv_mr **mr_p,
                            int silent)
@@ -532,7 +535,11 @@ ucs_status_t uct_ib_reg_mr(struct ibv_pd *pd, void *addr, size_t length,
 #else
     mr = UCS_PROFILE_CALL(ibv_reg_mr, pd, addr, length, access_flags);
 #endif
+    ib_total_reg += length;
+    ib_total_reg_segs++;
+    fprintf(stderr, "ibv_reg %ld in %ld\n", ib_total_reg, ib_total_reg_segs);
     if (mr == NULL) {
+       fprintf(stderr, "ibv_reg failed errno %d\n", errno);
         uct_ib_md_print_mem_reg_err_msg(addr, length, access_flags,
                                         errno, silent);
         return UCS_ERR_IO_ERROR;
@@ -550,6 +557,9 @@ ucs_status_t uct_ib_dereg_mr(struct ibv_mr *mr)
         return UCS_OK;
     }
 
+    ib_total_reg -= mr->length;
+    ib_total_reg_segs--;
+    fprintf(stderr, "ibv_dereg %ld in %ld\n", ib_total_reg, ib_total_reg_segs);
     ret = UCS_PROFILE_CALL(ibv_dereg_mr, mr);
     if (ret != 0) {
         ucs_error("ibv_dereg_mr() failed: %m");

and the output at the time of crash typically looks like this:

ibv_reg 4850063904 in 41809
ibv_reg 3614325824 in 43902
ibv_reg 4810457792 in 56123
ibv_reg 4850074352 in 41810
ibv_reg 4059565248 in 43322
ibv_dereg 5032840000 in 45050
ibv_reg 4810486480 in 56124
ibv_reg 4850089664 in 41811
ibv_reg 5032978160 in 45051
ibv_reg failed errno 12

So: ucx is registering a large number of memory segments. The actual amount of memory is large but (I think) not excessive: the total memory used by each process is about 32GB. (Again, 2 procs/node.) But I think the total number of memory segments seems very large: each node has of order 100,000 memory segments registered with the IB interface. I'm wondering if there is some fragmentation in the ucx memory pool.

Trying to reduce the number of receive buffers with export UCX_IB_RX_MAX_BUFS=25000 causes a hang with: uct_iface.c:152 UCX WARN Memory pool rc_recv_desc is empty

Aug 22 '20 17:08 trquinn

Any chance this will be fixed in 6.11?

Oct 21 '20 22:10 trquinn

@trquinn Does the issue still occur with UCX 1.9.0?

Oct 21 '20 23:10 evan-charmworks

I just tried with UCX v1.9.0 and Charm v6.11.0-beta. The issue still occurs.

Oct 24 '20 04:10 trquinn

@brminich: Do you have any insights as to what might be happening here? (Or the linked issue on the UCX repo openucx/ucx#5291)

Oct 29 '20 23:10 nitbhat

charm charm copied to clipboard

ChaNGa crashes/hangs with UCX machine layer (in SMP mode) on Frontera

charm
charm copied to clipboard