rain icon indicating copy to clipboard operation
rain copied to clipboard

Robust handling of worker and subworker crashes

Open gavento opened this issue 7 years ago • 1 comments

Currently a crash of a subworker may crash a worker, and a crash of a worker may crash the server. We need to improve this. However, we are not aiming for infrastructure resiliency now. Subworker crash may still fail the task (and so also the session) and worker crash may still lose all the objects and fail all involved sessions. The main goal is to keep the server running and deliver a graceful error.

A robust failure handling will open up the road to retrying tasks (possibly on different workers) and later to worker crash resiliency.

gavento avatar Apr 13 '18 15:04 gavento

Executor (=subworker) crashing is now handled with a graceful error with logs. Governor (=worker) crash still results into overall panic.

spirali avatar Jul 02 '18 16:07 spirali