banyan-julia
banyan-julia copied to clipboard
Don't destroy the job unless the executor crashes
We should really only end a running job if the program crashes on the executor or the user explicitly calls destroy_job
.
When scheduling fails
On a call to a writing function or to collect
, recorded lazy computation is scheduled and executed. If the scheduling fails, we currently destroy the job. If you're using Banyan Julia from a notebook, this is undesirable since then you have to restart the job (can take 1-2 minutes) just because a single cell failed. Instead, we should make it so that a call to a writing function or to collect
does not modify global state but will roll back in the case of a failure.
When an exception occurs on the cluster
If the job crashes in the backend, we kind of have to destroy the job. But if there's just an exception that occurs, we should ideally propagate that back to the client side and roll back in the same way that we would roll back in the case of a scheduling failure.