kaos
kaos copied to clipboard
Resource exhaustion is not caught and prompted by the system to the user
What is the current behaviour?
When training a big model which exceeds container memory, the training pipeline never fails and keeps running.
What is the expected behaviour?
Handle the failure and alert the user.