flux-core
flux-core copied to clipboard
Resource temporarily unavailable
When I run the HPL test program using Slurm + Flux, an error is reported when the job exits after two hours, I'm sure the resources are sufficient because I submitted the HPL directly with Slurm and it worked
2022-03-10T07:52:51.742839Z broker.err[0]: mcast error to child rank 1: Resource temporarily unavailable
This is my Slurm script
#!/bin/bash
#SBATCH -J fluxjob #SBATCH -p flux #SBATCH -N 17 #SBATCH --tasks-per-node=32 #SBATCH --mem=126536 #SBATCH -o logo.out #SBATCH -e loge.out
srun --mpi=pmi2 flux start /public/home/slurmtest/zhuwq/software/hpl-2.3/bin/Linux_PII_CBLAS/flux_batch.sh
The flux_batch.sh script is as follows
#!/bin/sh
flux mini run -N 544 -n 544 /public/home/slurmtest/zhuwq/software/hpl-2.3/bin/Linux_PII_CBLAS/xhpl
Is slurm enforcing a 2 hour time limit? Perhaps we are handling a signal poorly?
The mcast error is not fatal, but it may indicate that rank 0 lost contact with rank 1 (although the expected error is EHOSTUNREACH which would not have been logged). Any other errors in the logs that would give an idea of why flux is exiting?
You could try setting a lower time limit in slurm and see if the problem can be reproduced.
Also to capture more flux logs you could add -o,-Slog-stderr-level=7 to the flux start command.
Now I'm wondering if that "resource temporarily unavailable" is from an older libzmq. Please post output of flux version when you have a chance.
Closing old issue. Please reopen with requested information if this is still a problem!