Create an optional mechanism to avoid duplicate jobs
We create Kubernetes pods to run Spydra and it submits a job to Dataproc. Sometimes our pods are removed and we automatically recreate the pod(Spydra) again, and it submits that job again. In the end, there are some duplicate jobs are running in Dataproc. Those jobs may take hours which costs a lot.
I think we can create an optional mechanism to avoid this situation by labeling jobs, and when we create a job, we can check whether there is any job with that label whose status is DONE, if so we should not submit that job and can throw an exception.
Dataproc already has a mechanism for this -- the job id. You cannot have Dataproc jobs with duplicate ids. As long as you don't delete jobs after they finish, this can be used to avoid submitting the same job multiple times.
$ gcloud beta dataproc jobs submit spark-sql --id=bar -e "select 1" --cluster my-cluster
<lots-of-output>
$ gcloud beta dataproc jobs submit spark-sql --id=bar -e "select 1" --cluster my-cluster
ERROR: (gcloud.beta.dataproc.jobs.submit.spark-sql) ALREADY_EXISTS: Already exists: Failed to submit job: Job projects/myproject/regions/us-central1/jobs/bar
@karth295 Thanks for your answer but there is another case, what if the job had been submitted before and failed for some reason? We want to submit it again to try again. In this case, it will not be able to resubmit.
My scenario is:
- Submit the job for the first time, Job(id=1, labels= [ " task_id" -> "task-x" ] )
- Check if there is a job with that label if it is
- running, throw an exception
- done, throw an exception
- otherwise, submit
- Result: It will be submitted.
- After some time, the job is failed
- The job is resubmitted as Job(id=2, labels= [ " task_id" -> "task-x" ] )
- Check if there is a job with that label if it is
- running, throw an exception
- done, throw an exception
- otherwise, submit
- Result: It will be submitted because even it had already submitted before, it had failed and we want to rerun it.
Other scenario:
- Submit the job for the first time, Job(id=1, labels= [ " task_id" -> "task-x" ] )
- Check result: submit
- After some time, the job is done
- The job is resubmitted as Job(id=2, labels= [ " task_id" -> "task-x" ] )
- Check result: It will not be resubmitted because there is already a job with that label which is already done
Ah, fair enough.
Another solution to consider is using restartable jobs and letting Dataproc re-run jobs on failure. You can specify a request_id (docs) so that when your pod is recreated, it adopts the same job.
That may or may not work for you, depending on what else your pod needs to do when it's recreated.
@karth295 Thanks, using both restartable jobs and request_id is a good approach but it does not fit in my case. Our jobs are not graceful enough.