NorESM icon indicating copy to clipboard operation
NorESM copied to clipboard

Numerous run time issues on Betzy login3

Open mvertens opened this issue 1 year ago • 0 comments

This is a beginning place holder for numerous issues that have occurred on Betzy as part of the OS upgrade. Currently, since only login3 is available - these have all occurred there.

From @mvertens: Currently this is all using the noresm2_5_alpha06 code base that was just created last week. There are two separate errors I encountered - both which I reported to sigma2.

  1. the UCX error that led to a timeout.

==== backtrace (tid: 40171) ==== 0 0x000000000005e810 uct_ud_ep_deferred_timeout_handler() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/ib/ud/base/ud_ep.c:278 1 0x000000000004fd37 ucs_callbackq_slow_proxy() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/callbackq.c:404 2 0x000000000004881a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc

There seems to be an outstanding issue on this here: https://github.com/openucx/ucx/issues/5159 Sigma2 suggests moving to the intel/2023a tool chain (currently @mvdebolskiy is working on this) - but it might be that Sigma2 need to upgrade there openucx.

  1. In a totally different experimental configuration - I obtained the following error: [LOG_CAT_MLB] Registration of 0 network context failed. Don't use HCOLL [LOG_CAT_MLB] Failed to grow mlb dynamic manager [LOG_CAT_MLB] Payload allocation failed [LOG_CAT_BASESMUMA] Failed to shmget with IPC_PRIVATE, size 20971520, IPC_CREAT; errno 28:No space left on device [LOG_CAT_MLB] Registration of 0 network context failed. Don't use HCOLL [LOG_CAT_MLB] Failed to grow mlb dynamic manager

In this case the solution was to set the environment variable OMPI_MCA_coll_hcoll_enable to 0. Sigma2 has a fix for (2) which requires the 2023a tool chain and that Matvey is working on.

I am not sure that updating to the 2023a tool chain will fix (1). I think we should try the new tool chain once @mvdebolskiy is ready with the update and see if (1) occurs again.

mvertens avatar Oct 09 '24 19:10 mvertens