elasticdl
elasticdl copied to clipboard
Kill job when controller terminates
@workingloong reminded a user request -- kill the job if the controller terminates.
This request comes from a deployment of ElasticDL where users run the command-line tool elasticdl
from within a container. The request is when the container terminates, the job, if not yet completes, should be killed.
To what I understand, currently, the elasticdl
command submits the job and returns a job ID. So, the user could loop and check the status of the ID, like the following:
id=$(elasticdl train ...)
for true; do
if kubectl get status $Id != "Completed"; then
sleep 1
else
exit 0
done
How about we add a synchronous mode to elasticdl, so that the Go binary does the above waiting loop and the above shell script could be simplified into
elasticdl --wait train ...
In this way, we can add a signal processor to the elasticdl binary to make it file a Kubernetes API call to kill the job when it receives signal 9 (KILL).