flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

Resource temporarily unavailable

Open zhuwqvip opened this issue 3 years ago • 2 comments

When I run the HPL test program using Slurm + Flux, an error is reported when the job exits after two hours, I'm sure the resources are sufficient because I submitted the HPL directly with Slurm and it worked

2022-03-10T07:52:51.742839Z broker.err[0]: mcast error to child rank 1: Resource temporarily unavailable

This is my Slurm script

#!/bin/bash

#SBATCH -J fluxjob #SBATCH -p flux #SBATCH -N 17 #SBATCH --tasks-per-node=32 #SBATCH --mem=126536 #SBATCH -o logo.out #SBATCH -e loge.out

srun --mpi=pmi2 flux start /public/home/slurmtest/zhuwq/software/hpl-2.3/bin/Linux_PII_CBLAS/flux_batch.sh

The flux_batch.sh script is as follows

#!/bin/sh

flux mini run -N 544 -n 544 /public/home/slurmtest/zhuwq/software/hpl-2.3/bin/Linux_PII_CBLAS/xhpl

zhuwqvip avatar Mar 10 '22 10:03 zhuwqvip

Is slurm enforcing a 2 hour time limit? Perhaps we are handling a signal poorly?

The mcast error is not fatal, but it may indicate that rank 0 lost contact with rank 1 (although the expected error is EHOSTUNREACH which would not have been logged). Any other errors in the logs that would give an idea of why flux is exiting?

You could try setting a lower time limit in slurm and see if the problem can be reproduced.

Also to capture more flux logs you could add -o,-Slog-stderr-level=7 to the flux start command.

garlick avatar Mar 10 '22 13:03 garlick

Now I'm wondering if that "resource temporarily unavailable" is from an older libzmq. Please post output of flux version when you have a chance.

garlick avatar Mar 10 '22 14:03 garlick

Closing old issue. Please reopen with requested information if this is still a problem!

grondo avatar Dec 10 '22 17:12 grondo