Handle properly out-of-memory errors
Sometimes when dstack job runs and then dies - it shows no sign. I suspect this happens when process is terminated externally, e.g. out of memory. Show the exit code (1 or 0) similar to kubernetes events when a container/job is terminated. This helps debugging
Currently, exit codes do get into the logs. One issue that I'm certainly aware of is not correctly handling the case when the runner crashes because of an out-of-memory issue.
Steps to reproduce:
- Run a workflow that quickly consumes the entire machine's memory Expectation:
- The run is marked as Failed
- The cloud instance is destroyed
- There is an error in the log Actual:
- The cloud instance is not destroyed (perhaps because the runner crashed)
- There is no error in the log (perhaps the runner crashed)
I renamed the issue accordingly.
If there are other problems, let's submit separate issues with detailed steps to reproduce.
Container exit codes now can be viewed with dstack ps -v (#337).