dsub
dsub copied to clipboard
Multiple jobs with retries
Hi I typically use a loop to launch jobs that can all run concurrently and are not dependent on each other. I would like to use the retries flag to kick off the independent jobs that fail. This does not seem to work. Is there a solution to my problem Example Code: %%bash --out LINE_COUNT_JOB_ID
Get a shorter username to leave more characters for the job name.
DSUB_USER_NAME="$(echo "${OWNER_EMAIL}" | cut -d@ -f1)"
For AoU RWB projects network name is "network".
AOU_NETWORK=network AOU_SUBNETWORK=subnetwork
MACHINE_TYPE="n2-standard-4"
BASH_SCRIPT="gs://fc-secure-cb192ac6-30ba-46b9-92ee-896a6e36c63e/dsub/hpoisner/snplist_step1/SNPlist_step1_mac75k.sh"
LOWER=1
UPPER=23
for ((chromo=$LOWER;chromo<$UPPER;chromo+=1))
do
dsub
--provider google-cls-v2
--user-project "${GOOGLE_PROJECT}"
--project "${GOOGLE_PROJECT}"
--image "marketplace.gcr.io/google/ubuntu1804:latest"
--network "${AOU_NETWORK}"
--subnetwork "${AOU_SUBNETWORK}"
--service-account "$(gcloud config get-value account)"
--user "${DSUB_USER_NAME}"
--regions us-central1
--logging "${WORKSPACE_BUCKET}/dsub/v7/logs/{job-name}/{user-id}/$(date +'%Y%m%d/%H%M%S')/{job-id}-{task-id}-{task-attempt}.log"
"$@"
--preemptible
--retries 2
--wait
--boot-disk-size 1000
--machine-type ${MACHINE_TYPE}
--name "${JOB_NAME}"
--script "${BASH_SCRIPT}"
--env GOOGLE_PROJECT=${GOOGLE_PROJECT}
--input plink=""
--input bgen_file=""
--input sample_file=""
--env chrom=${chromo}
--output-recursive OUTPUT_PATH="${OUTPUT_FILES}/${chromo}"
done
Hi @hpoisner, you mention that does not seem to work, but can you please describe what you do observe to be happening? Are there any error messages? Any relevant logging? Any output that would indicate that a retry is not happening?
The issue is it turns jobs that should run in parallel into sequential jobs. There aren't any specific error messages. We just want to run multiple jobs at once with the capacity to retry
I see you're doing a loop over the chromosomes, and each call to dsub has a --wait
flag. This means that each chromosome will wait to completion before going on to the next.
To run the jobs in parallel, instead you'll want to define a tasks TSV file where each line is a different chromosome. See https://github.com/DataBiosphere/dsub#submitting-a-batch-job for details on the tasks file format and the --tasks
flag.