NorESM Betzy problem when running on 2 nodes with noresm2.5

I have now encountered the same issue when running I compsets and F compsets on 2 nodes. Errors like the following appear in the cesm.log file

134: [b3296.betzy.sigma2.no:1104993] pml_ucx.c:911 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer 134: [b3296:1104993] *** An error occurred in MPI_Send 134: [b3296:1104993] *** reported by process [23299297509376,134] 134: [b3296:1104993] *** on communicator MPI COMMUNICATOR 21 CREATE FROM 20 134: [b3296:1104993] *** MPI_ERR_OTHER: known error not in list 134: [b3296:1104993] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, 134: [b3296:1104993] *** and potentially your MPI job)

The solution seems to be to increase the number of nodes to 4 - and then everything works. I am writing to sigma2 to raise this issue as well.

Nov 21 '24 18:11 mvertens

@mvertens - Hi, Betzy has a minimum node number of 4 for jobs on the "normal" queue. "devel" jobs can run on 1-4 nodes for a short time (up to 60 min). It seems this requirement is there to encourage moving smaller jobs to Fram, so that these do not fill up the queue on Betzy.

See job types description here: https://documentation.sigma2.no/jobs/job_types/betzy_job_types.html

Nov 22 '24 06:11 TomasTorsvik

The NorESM configuration for Betzy currently only sends jobs with 4 or more nodes to the normal queue. The devel queue is marked with a minimum of 1 and a maximum of 4 nodes and the preproc queue has no restrictions. See <ccs_config>/machines/betzy/config_batch.xml.

@mvertens, which queue was used for your job? Even if it was devel (which I think should have worked), I think we should restrict preproc to 1 node as that is a the Betzy limit.

Nov 22 '24 08:11 gold2718

Thanks Mariana. I have experienced the same error with the MakingWave code during the last week. wave-ocean-ice with data atmos was working find on 2 nodes (develop queue) from the start of November. But suddenly, around November 13-15th something has changed so none of these compsets (or perturbations thereof are running) with the same pe-layout. I will try increasing the number of nodes, although it is more costly in the debug-phase.

Nov 22 '24 08:11 JensBDebernard

@TomasTorsvik @gold2718 @JensBDebernard - I have double checked and the queue is devel. This also worked for me up until around the 15th and suddenly stopped working. I have raised an issue with sigma2.

Nov 22 '24 08:11 mvertens

Should we take the hint and set up a test suite of smaller tests on Fram? It would just mean firing off and then checking two test runs instead of one.

Nov 22 '24 08:11 gold2718

I got a response from sigma2 that they have escalated this ticket to their second line support, and they'll follow up shortly.

Nov 22 '24 08:11 mvertens

Our quota on Fram is quite limited, only 150K CPU hours on nn2345k. We could ask for an increased quota, but it would probably be "non-prioritized" for the current allocation period.

Nov 22 '24 09:11 TomasTorsvik