banyan-julia icon indicating copy to clipboard operation
banyan-julia copied to clipboard

Job sometimes crashes with Open-MPI-related message

Open calebwin opened this issue 2 years ago • 7 comments

Here is the output (when job is run with return_logs=true):

slurmstepd: error: *** JOB 3737 ON compute-dy-t3large-2 CANCELLED AT 2021-08-03T16:28:28 ***
slurmstepd: error: *** STEP 3737.0 ON compute-dy-t3large-2 CANCELLED AT 2021-08-03T16:28:28 ***

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
epoll_wait at /lib64/libc.so.6 (unknown line)

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
mca_btl_vader_fbox_read_header at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:72 [inlined]
mca_btl_vader_check_fboxes at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:195 [inlined]
mca_btl_vader_component_progress at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_component.c:765

@cailinw has also experienced this. The job hangs for like 10 minutes after having executed some code but not all the code for the job and then prints the above message.

calebwin avatar Aug 04 '21 02:08 calebwin