dstack Handle properly out-of-memory errors

Sometimes when dstack job runs and then dies - it shows no sign. I suspect this happens when process is terminated externally, e.g. out of memory. Show the exit code (1 or 0) similar to kubernetes events when a container/job is terminated. This helps debugging

Jul 08 '22 12:07 kengz

Currently, exit codes do get into the logs. One issue that I'm certainly aware of is not correctly handling the case when the runner crashes because of an out-of-memory issue.

Steps to reproduce:

Run a workflow that quickly consumes the entire machine's memory Expectation:
The run is marked as Failed
The cloud instance is destroyed
There is an error in the log Actual:
The cloud instance is not destroyed (perhaps because the runner crashed)
There is no error in the log (perhaps the runner crashed)

I renamed the issue accordingly.

If there are other problems, let's submit separate issues with detailed steps to reproduce.

Sep 23 '22 10:09 peterschmidt85

Container exit codes now can be viewed with dstack ps -v (#337).

Apr 28 '23 06:04 r4victor