banyan-julia icon indicating copy to clipboard operation
banyan-julia copied to clipboard

Don't destroy the job unless the executor crashes

Open calebwin opened this issue 3 years ago • 0 comments

We should really only end a running job if the program crashes on the executor or the user explicitly calls destroy_job.

When scheduling fails

On a call to a writing function or to collect, recorded lazy computation is scheduled and executed. If the scheduling fails, we currently destroy the job. If you're using Banyan Julia from a notebook, this is undesirable since then you have to restart the job (can take 1-2 minutes) just because a single cell failed. Instead, we should make it so that a call to a writing function or to collect does not modify global state but will roll back in the case of a failure.

When an exception occurs on the cluster

If the job crashes in the backend, we kind of have to destroy the job. But if there's just an exception that occurs, we should ideally propagate that back to the client side and roll back in the same way that we would roll back in the case of a scheduling failure.

calebwin avatar Oct 15 '21 00:10 calebwin