bunny icon indicating copy to clipboard operation
bunny copied to clipboard

Task running before a required predecessor task

Open mbookman opened this issue 8 years ago • 3 comments

I am using rc5 and running the GDC's transform.cwl with rabix:

https://github.com/NCI-GDC/gdc-dnaseq-cwl/blob/master/workflows/dnaseq/transform.cwl

$ ./rabix/rabix --basedir ./out gdc-dnaseq-cwl/workflows/dnaseq/transform.cwl gdc-dnaseq-cwl/workflows/dnaseq/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.json 
[2017-05-04 20:34:58.521] [INFO] Job root.samtools_bamtobam has started
[2017-05-04 20:34:58.585] [INFO] Pulling docker image quay.io/ncigdc/samtools:1
[2017-05-04 20:35:00.098] [INFO] Running command line: /usr/local/bin/samtools view -Shb /mnt/SCRATCH/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam > /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/samtools_bamtobam/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
[2017-05-04 20:35:45.232] [INFO] Job root.samtools_bamtobam has completed
[2017-05-04 20:35:46.023] [INFO] Job root.picard_validatesamfile_original has started
[2017-05-04 20:35:46.028] [INFO] Job root.biobambam_bamtofastq has started
[2017-05-04 20:35:46.039] [INFO] Job root.bam_readgroup_to_json has started
[2017-05-04 20:35:46.062] [INFO] Pulling docker image quay.io/ncigdc/bam_readgroup_to_json:1
[2017-05-04 20:35:46.075] [INFO] Pulling docker image quay.io/ncigdc/biobambam:1
[2017-05-04 20:35:46.075] [INFO] Pulling docker image quay.io/ncigdc/picard:1
[2017-05-04 20:35:46.085] [INFO] Running command line: /usr/local/bin/bam_readgroup_to_json --bam_path /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/samtools_bamtobam/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam --mode lenient
[2017-05-04 20:35:46.119] [INFO] Running command line: java -jar /usr/local/bin/picard.jar ValidateSamFile OUTPUT=NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam.metrics INPUT=/home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/samtools_bamtobam/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam MAX_OUTPUT=2147483647 MODE=VERBOSE TMP_DIR=. VALIDATION_STRINGENCY=LENIENT
[2017-05-04 20:35:46.145] [INFO] Running command line: /usr/local/bin/bamtofastq T=tempfq collate=1 exclude=QCFAIL,SECONDARY,SUPPLEMENTARY filename=/home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/samtools_bamtobam/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam gz=1 inputformat=bam level=5 outputdir=. outputperreadgroup=1 outputperreadgroupsuffixF=_1.fq.gz outputperreadgroupsuffixF2=_2.fq.gz outputperreadgroupsuffixO=_o1.fq.gz outputperreadgroupsuffixO2=_o2.fq.gz outputperreadgroupsuffixS=_s.fq.gz tryoq=1
[2017-05-04 20:35:48.510] [INFO] Job root.bam_readgroup_to_json has completed
[2017-05-04 20:35:49.017] [INFO] Job root.readgroup_json_db.1 has started
[2017-05-04 20:35:49.039] [INFO] Pulling docker image quay.io/ncigdc/readgroup_json_db:1
[2017-05-04 20:35:49.060] [INFO] Running command line: /usr/local/bin/readgroup_json_db --json_path /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/bam_readgroup_to_json/SRR622461.json --uuid demo
[2017-05-04 20:35:52.973] [INFO] Job root.readgroup_json_db.1 has completed
[2017-05-04 20:35:54.016] [INFO] Job root.merge_readgroup_json_db has started
[2017-05-04 20:35:54.041] [INFO] Pulling docker image quay.io/ncigdc/merge_sqlite:3
[2017-05-04 20:35:54.063] [INFO] Running command line: /usr/local/bin/merge_sqlite --source_sqlite /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/readgroup_json_db/1/demo.db --uuid demo
[2017-05-04 20:35:55.927] [INFO] Job root.merge_readgroup_json_db has completed
[2017-05-04 20:36:30.328] [INFO] Job root.picard_validatesamfile_original has completed
[2017-05-04 20:36:31.018] [INFO] Job root.picard_validatesamfile_original_to_sqlite has started
[2017-05-04 20:36:31.023] [INFO] Pulling docker image quay.io/ncigdc/picard_metrics_sqlite:1
[2017-05-04 20:36:31.055] [INFO] Running command line: /usr/local/bin/picard_metrics_sqlite --metric_name ValidateSamFile --bam NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam --input_state original --metric_path /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/picard_validatesamfile_original/NA12878.chrom20.ILLUMINA.bwa.CEU.low_coverage.20121211.bam.metrics --uuid demo
[2017-05-04 20:36:33.830] [INFO] Job root.picard_validatesamfile_original_to_sqlite has completed
[2017-05-04 20:37:02.983] [INFO] Job root.biobambam_bamtofastq has completed
[2017-05-04 20:37:03.128] [INFO] Job root.remove_duplicate_fastq2.1 has started
[2017-05-04 20:37:03.131] [INFO] Job root.remove_duplicate_fastq1.1 has started
[2017-05-04 20:37:03.134] [INFO] Pulling docker image quay.io/ncigdc/fastq_remove_duplicate_qname:1
[2017-05-04 20:37:03.139] [INFO] Job root.remove_duplicate_fastq_s.1 has started
[2017-05-04 20:37:03.142] [INFO] Pulling docker image quay.io/ncigdc/fastq_remove_duplicate_qname:1
[2017-05-04 20:37:03.147] [INFO] Job root.remove_duplicate_fastq_o1.1 has started
[2017-05-04 20:37:03.151] [INFO] Pulling docker image quay.io/ncigdc/fastq_remove_duplicate_qname:1
[2017-05-04 20:37:03.157] [INFO] Running command line: bash -c 'zcat /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/biobambam_bamtofastq/SRR622461_1.fq.gz | /usr/local/bin/fastq_remove_duplicate_qname - | gzip - > SRR622461_1.fq.gz'
[2017-05-04 20:37:03.170] [INFO] Running command line: bash -c 'zcat /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/biobambam_bamtofastq/SRR622461_2.fq.gz | /usr/local/bin/fastq_remove_duplicate_qname - | gzip - > SRR622461_2.fq.gz'
[2017-05-04 20:37:03.173] [INFO] Pulling docker image quay.io/ncigdc/fastq_remove_duplicate_qname:1
[2017-05-04 20:37:03.188] [INFO] Running command line: bash -c 'zcat /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/biobambam_bamtofastq/SRR622461_s.fq.gz | /usr/local/bin/fastq_remove_duplicate_qname - | gzip - > SRR622461_s.fq.gz'
[2017-05-04 20:37:03.202] [INFO] Running command line: bash -c 'zcat /home/mbookman/out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/biobambam_bamtofastq/SRR622461_o1.fq.gz | /usr/local/bin/fastq_remove_duplicate_qname - | gzip - > SRR622461_o1.fq.gz'
[2017-05-04 20:37:04.137] [INFO] Job root.sort_scattered_fastq_o2 has started
[2017-05-04 20:37:04.155] [INFO] Job root.sort_scattered_fastq_o2 will be executed without a container. It is an expression tool.
[2017-05-04 20:37:05.666] [INFO] Job root.sort_scattered_fastq_o2 has completed
[2017-05-04 20:37:06.119] [INFO] Job root.decider_bwa_o2 has started
[2017-05-04 20:37:06.128] [INFO] Job root.fastqc_o2.1 has started
[2017-05-04 20:37:06.141] [INFO] Job root.decider_bwa_o2 will be executed without a container. It is an expression tool.
[2017-05-04 20:37:06.158] [INFO] Pulling docker image quay.io/ncigdc/fastqc:1
[2017-05-04 20:37:06.176] [INFO] Running command line: /usr/local/FastQC/fastqc --dir . --format fastq --noextract --outdir . --threads 8
[2017-05-04 20:37:07.705] [INFO] Job root.remove_duplicate_fastq_o1.1 has completed
[2017-05-04 20:37:07.874] [ERROR] Failed to execute status command for ddeb3bb0-46a5-4e49-bb1d-7b2dc80297aa. Could not collect outputs.
org.rabix.executor.ExecutorException: Could not collect outputs.
	at org.rabix.executor.handler.impl.JobHandlerImpl.postprocess(JobHandlerImpl.java:352) ~[rabix-backend-local.jar:na]
	at org.rabix.executor.execution.command.StatusCommand.run(StatusCommand.java:53) ~[rabix-backend-local.jar:na]
	at org.rabix.executor.execution.JobHandlerCommand.run(JobHandlerCommand.java:51) [rabix-backend-local.jar:na]
	at org.rabix.executor.execution.JobHandlerRunnable.run(JobHandlerRunnable.java:60) [rabix-backend-local.jar:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
Caused by: org.rabix.bindings.BindingException: Failed to populate outputs
	at org.rabix.bindings.cwl.CWLProcessor.postprocess(CWLProcessor.java:130) ~[rabix-backend-local.jar:na]
	at org.rabix.bindings.cwl.CWLBindings.postprocess(CWLBindings.java:83) ~[rabix-backend-local.jar:na]
	at org.rabix.executor.handler.impl.JobHandlerImpl.postprocess(JobHandlerImpl.java:330) ~[rabix-backend-local.jar:na]
	... 6 common frames omitted
Caused by: org.rabix.bindings.cwl.expression.CWLExpressionException: Failed evaluating expression ${
   function include(arr,obj) {
     return (arr.indexOf(obj) != -1)
   }
 
   function endsWith(str, suffix) {
     return str.indexOf(suffix, str.length - suffix.length) !== -1;
   }
 
   function local_basename(path) {
     var basename = path.split(/[\\/]/).pop();
     return basename
   }
 
   function local_dirname(path) {
     return path.replace(/\\/g,'/').replace(/\/[^\/]*$/, '');
   }
 
   function get_slice_number(fastq_name) {
     if (endsWith(fastq_name, '_1.fq.gz')) {
       return -8
     }
     else if (endsWith(fastq_name, '_s.fq.gz')) {
       return -8
     }
     else if (endsWith(fastq_name, '_o1.fq.gz')) {
       return -9
     }
     else if (endsWith(fastq_name, '_o2.fq.gz')) {
       return -9
     }
     else {
       exit()
     }
   }
   
   // get predicted readgroup basenames from fastq
   var readgroup_basename_array = [];
   for (var i = 0; i < inputs.fastq_path.length; i++) {
     var fq_path = inputs.fastq_path[i];
     var fq_name = local_basename(fq_path.location);
 
     var slice_number = get_slice_number(fq_name);
     
     var readgroup_name = fq_name.slice(0,slice_number) + ".json";
     readgroup_basename_array.push(readgroup_name);
   }
 
   // find which readgroup items are in predicted basenames
   var readgroup_array = [];
   for (var i = 0; i < inputs.readgroup_path.length; i++) {
     var readgroup_path = inputs.readgroup_path[i];
     var readgroup_basename = local_basename(readgroup_path.location);
     if (include(readgroup_basename_array, readgroup_basename)) {
       readgroup_array.push(readgroup_path);
     }
   }
 
   var readgroup_sorted = readgroup_array.sort(function(a,b) { return a.location > b.location ? 1 : (a.location < b.location ? -1 : 0) })
   return {'output_readgroup_paths': readgroup_sorted}
 }
.
	at org.rabix.bindings.cwl.expression.javascript.CWLExpressionJavascriptResolver.evaluate(CWLExpressionJavascriptResolver.java:89) ~[rabix-backend-local.jar:na]
	at org.rabix.bindings.cwl.CWLProcessor.postprocess(CWLProcessor.java:127) ~[rabix-backend-local.jar:na]
	... 8 common frames omitted
<removed repeated stack traces>
[2017-05-04 20:37:07.889] [INFO] Failed to execute status command for ddeb3bb0-46a5-4e49-bb1d-7b2dc80297aa. Could not collect outputs.
[2017-05-04 20:37:07.899] [WARN] Job root.decider_bwa_o2, rootId: 2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c failed: Failed to execute status command for ddeb3bb0-46a5-4e49-bb1d-7b2dc80297aa. Could not collect outputs.
[2017-05-04 20:37:07.930] [WARN] Root job 2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c failed.
[2017-05-04 20:37:08.115] [INFO] Job root.sort_scattered_fastq_o1 has started
[2017-05-04 20:37:08.129] [INFO] Job root.sort_scattered_fastq_o1 will be executed without a container. It is an expression tool.
[2017-05-04 20:37:08.134] [INFO] Failed to execute a Job

$ ls -ltr out/2f09d953-e255-45ea-8fe1-e1b7fdf5dc0c/root/
total 60
drwxr-xr-x 2 mbookman docker 4096 May  4 20:35 samtools_bamtobam
drwxr-xr-x 2 mbookman docker 4096 May  4 20:35 bam_readgroup_to_json
drwxr-xr-x 4 mbookman docker 4096 May  4 20:35 readgroup_json_db
drwxr-xr-x 2 mbookman docker 4096 May  4 20:35 merge_readgroup_json_db
drwxrwxrwx 2 mbookman docker 4096 May  4 20:36 picard_validatesamfile_original
drwxr-xr-x 3 mbookman docker 4096 May  4 20:36 picard_validatesamfile_original_to_sqlite
drwxr-xr-x 2 mbookman docker 4096 May  4 20:37 biobambam_bamtofastq
drwxr-xr-x 4 mbookman docker 4096 May  4 20:37 remove_duplicate_fastq2
drwxr-xr-x 4 mbookman docker 4096 May  4 20:37 remove_duplicate_fastq1
drwxr-xr-x 4 mbookman docker 4096 May  4 20:37 remove_duplicate_fastq_s
drwxr-xr-x 4 mbookman docker 4096 May  4 20:37 remove_duplicate_fastq_o1
drwxr-xr-x 2 mbookman docker 4096 May  4 20:37 sort_scattered_fastq_o2
drwxr-xr-x 2 mbookman docker 4096 May  4 20:37 decider_bwa_o2
drwxr-xr-x 4 mbookman docker 4096 May  4 20:37 fastqc_o2
drwxr-xr-x 2 mbookman docker 4096 May  4 20:37 sort_scattered_fastq_o1

What I observe in the output is that remove_duplicate_fastq_o2 did not run, but sort_scattered_fastq_o2 (which depends on it) did. That seems to be why it failed to "collect output".

Here are the relevant bits from the CWL:

  - id: remove_duplicate_fastq_o2
    run: ../../tools/fastq_remove_duplicate_qname.cwl
    scatter: INPUT
    in:
      - id: INPUT
        source: biobambam_bamtofastq/output_fastq_o2
    out:
      - id: OUTPUT

...

  - id: sort_scattered_fastq_o2
    run: ../../tools/sort_scatter_expression.cwl
    in:
      - id: INPUT
        source: remove_duplicate_fastq_o2/OUTPUT
    out:
      - id: OUTPUT

Using the cwltool gets well past this step, although it is running each step sequentially. It is not trying to build a graph and run in parallel. Still the CWL looks ok to me.

mbookman avatar May 04 '17 22:05 mbookman

Hi, it seems that remove_duplicate_fastq_o2 received an empty list at its input and was completed without going to the executor.

In that case the workflow went like this:

[] -> remove_duplicate_fastq_o2 - [[]] -> sort_scattered_fastq_o2 - [[]] -> decider_bwa_o2

And then while evaluating decider_bwa_expression.cwl the executor failed.

Most likely the issue is with the empty output of remove_duplicate_fastq_o2 being wrapped inside of another array and decider_bwa_expression only working with empty lists or lists of files, throwing an exception in this case.

Will keep investigating and will perhaps create a pull request with a quick fix to output wrapping today.

milos-ljubinkovic avatar May 05 '17 13:05 milos-ljubinkovic

This is a regression, on our part. Scattering on empty lists worked properly at some point. I've submitted new conformance tests to check for these cases: common-workflow-language/common-workflow-language#435

StarvingMarvin avatar May 05 '17 17:05 StarvingMarvin

Hi, we have made a new release (v1.0.2) just now, this fix should be included inside.

milos-ljubinkovic avatar Oct 18 '17 18:10 milos-ljubinkovic