dstack icon indicating copy to clipboard operation
dstack copied to clipboard

Handle properly out-of-memory errors

Open kengz opened this issue 3 years ago • 1 comments

Sometimes when dstack job runs and then dies - it shows no sign. I suspect this happens when process is terminated externally, e.g. out of memory. Show the exit code (1 or 0) similar to kubernetes events when a container/job is terminated. This helps debugging

kengz avatar Jul 08 '22 12:07 kengz

Currently, exit codes do get into the logs. One issue that I'm certainly aware of is not correctly handling the case when the runner crashes because of an out-of-memory issue.

Steps to reproduce:

  1. Run a workflow that quickly consumes the entire machine's memory Expectation:
  2. The run is marked as Failed
  3. The cloud instance is destroyed
  4. There is an error in the log Actual:
  5. The cloud instance is not destroyed (perhaps because the runner crashed)
  6. There is no error in the log (perhaps the runner crashed)

I renamed the issue accordingly.

If there are other problems, let's submit separate issues with detailed steps to reproduce.

peterschmidt85 avatar Sep 23 '22 10:09 peterschmidt85

Container exit codes now can be viewed with dstack ps -v (#337).

r4victor avatar Apr 28 '23 06:04 r4victor