banyan-julia icon indicating copy to clipboard operation
banyan-julia copied to clipboard

Black Scholes test fails at large sizes

Open calebwin opened this issue 2 years ago • 0 comments

This could be happening because of several reasons:

  • Running out of memory because GC.gc() calls not placed strategically (not really an issue any more)
  • Running out of memory because of not enough initial free memory (may not be an issue)
  • Running out of disk space because of EBS limitations or because of some unknown extra usage (the most common issue)
  • Job occasionally failing maybe because of printing (almost definitely not an issue)

(The below 2 issues might be because of https://github.com/open-mpi/ompi/issues/6014. So we may need a newer version of Open-MPI.)

  • Job occasionally failing because of:
slurmstepd: error: *** JOB 3737 ON compute-dy-t3large-2 CANCELLED AT 2021-08-03T16:28:28 ***
slurmstepd: error: *** STEP 3737.0 ON compute-dy-t3large-2 CANCELLED AT 2021-08-03T16:28:28 ***

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
epoll_wait at /lib64/libc.so.6 (unknown line)

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
mca_btl_vader_fbox_read_header at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:72 [inlined]
mca_btl_vader_check_fboxes at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:195 [inlined]
mca_btl_vader_component_progress at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_component.c:765
  • Job failing because of:
srun: error: compute-dy-t32xlarge-1: task 0: Killed
slurmstepd: error: compute-dy-t32xlarge-1 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
slurmstepd: error: *** STEP 112.0 ON compute-dy-t32xlarge-1 CANCELLED AT 2021-06-25T13:33:56 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: compute-dy-t32xlarge-1: task 1: Killed
srun: error: compute-dy-t32xlarge-1: tasks 2-7: Killed
  • The scheduler tries to have the result of the Black Scholes model be materialized to disk unnecessarily. This is likely because the finalizers of these values are not getting garbage collected on the gc() call on in the final compute. We may need to set these variables to nothing first or call gc(true).

calebwin avatar Aug 04 '21 02:08 calebwin