elasticdl icon indicating copy to clipboard operation
elasticdl copied to clipboard

Kill job when controller terminates

Open wangkuiyi opened this issue 4 years ago • 0 comments

@workingloong reminded a user request -- kill the job if the controller terminates.

This request comes from a deployment of ElasticDL where users run the command-line tool elasticdl from within a container. The request is when the container terminates, the job, if not yet completes, should be killed.

To what I understand, currently, the elasticdl command submits the job and returns a job ID. So, the user could loop and check the status of the ID, like the following:

id=$(elasticdl train ...)
for true; do
   if kubectl get status $Id != "Completed"; then
      sleep 1
   else
      exit 0
done

How about we add a synchronous mode to elasticdl, so that the Go binary does the above waiting loop and the above shell script could be simplified into

elasticdl --wait train ...

In this way, we can add a signal processor to the elasticdl binary to make it file a Kubernetes API call to kill the job when it receives signal 9 (KILL).

wangkuiyi avatar Apr 26 '20 05:04 wangkuiyi