Caleb Winston
Caleb Winston
> ``` > Start write > In Write on worker 1 on batch 1 > slurmstepd: error: *** JOB 862 ON compute-dy-t3large-1 CANCELLED AT 2021-08-28T22:12:51 *** > slurmstepd: error: ***...
Related issue: ``` Going to write to efs/job_2021-11-08-015625cb28a658f39f12ed0de8bedbfc341a65_val_19/part2_nrows=5453429.arrow Going to write to efs/job_2021-11-08-015625cb28a658f39f12ed0de8bedbfc341a65_val_19/part1_nrows=5453429.arrow srun: error: compute-dy-t3large-2: task 1: Exited with exit code 1 slurmstepd: error: compute-dy-t3large-2 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix:...
This is typically because the job ran out of memory.
Yes but how do you print stuff out? Technically you could write a partition annotation but that would be a little verbose. It would be nicer if we had `Banyan.print`,...
This happens rarely and was hard to reproduce.
This could be resolved by retrying calls to idempotent backend functionality.
On another instance, this occurred because somehow the job was being destroyed at the same time as a call to evaluate: ``` Basic data analytics on a small dataset: Error...
This has only come up when testing so maybe this happens when an error occurs but then the test proceeds past the failed assertion and even though the job has...
Both job destruction (in https://github.com/banyan-team/banyan-julia/issues/11) and calls to `compute` are now idempotent so we should be able to simply implement retry logic to fix this.
Another time this failed with the following: ``` Basic data analytics on a small dataset: Error During Test at /home/calebwin/Projects/banyan-julia/BanyanDataFrames/test/test_small_dataset.jl:63 Test threw exception Expression: nrow(iris_filtered) == 306 IOError(Base.IOError("read: connection timed...