lava icon indicating copy to clipboard operation
lava copied to clipboard

Add support to detect and handle a process that hangs in runtime

Open tim-shea opened this issue 3 years ago • 3 comments

Objective of issue: add some support to the runtime behavior to detect if a process has hung i.e. the process is not responding indefinitely and no exception is raised.

Lava version:

  • [x] 0.3.0 (feature release)

I'm submitting a ...

  • [ ] bug report
  • [x] feature request
  • [ ] documentation request

Current behavior: If a process model hangs during execution without raising an exception, the entire process graph will potentially become unresponsive without informing the user of the source of the error. If the user is then forced to issue a keyboard interrupt, that may not lead to a useful, debuggable error indicating the process model which was responsible for the hung execution.

Expected behavior: The runtime (service?) should apply a global timeout or other message-based "keep alive" signal to detect when a process model has hung. When any process model hangs, the runtime should pause or terminate process models as needed to unblock execution, or terminate completely if execution cannot be unblocked. The user should be reasonably informed of the source (which process model) and nature of the hang (e.g. timeout and phase).

When terminating processes due to a hang, the hung process should be cleaned up and resources released if at all possible, which might require some exceptional intervention since the process is unresponsive to messages.

Steps to reproduce: Will follow up with a mock example if possible.

Related code:

Other information:

tim-shea avatar Feb 11 '22 18:02 tim-shea

@bamsumit this topic arose earlier in the meeting on Lava runtime refactoring, which I thought might be relevant to the issue we saw the other night with hanging processes that didn't clean up properly. Please feel free to add your thoughts or insights into the best way to handle.

tim-shea avatar Feb 11 '22 18:02 tim-shea

Thaks @tim-shea for putting this as an issue. Yeah, we need better error propagation mechanism from process models. Also, ability to turn off multiprocessing for debugging would be a neat feature. @awintel what do you think?

bamsumit avatar Feb 12 '22 08:02 bamsumit

Yes, agree. That would be useful. File an issue for that but I believe this is nothing we can squeeze in Q1.

awintel avatar Feb 14 '22 22:02 awintel