cwltool
cwltool copied to clipboard
"File staging conflict" error is suppressed when using a container; files get overwritten
When you try to stage input files with an InitialWorkDirRequirement, and the files have the same name, they normally trigger an error that halts the workflow. However, if you are using a Singularity container, the error message does not arise and the files get silently overwritten.
Here is code reproducing the error;
make_file.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
baseCommand: [ "bash", "run.sh" ]
stdout: output.txt
requirements:
InitialWorkDirRequirement:
listing:
- entryname: run.sh
entry: |-
echo "$1"
inputs:
sampleId:
type: string
inputBinding:
position: 1
outputs:
output_file:
type: stdout
cat.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
baseCommand: [ "bash", "run.sh" ]
requirements:
# DockerRequirement: # <- this is the part that causes the error
# dockerPull: ubuntu:latest
InitialWorkDirRequirement:
listing:
- entryname: some_dir # <- put all the input files into a dir
writable: true
entry: "$({class: 'Directory', listing: inputs.input_files})"
- entryname: run.sh
entry: |-
for i in \$(find some_dir -type f); do cat \$i ; done
stdout: output.txt
inputs:
input_files:
type: File[]
outputs:
output_file:
type: stdout
workflow.cwl
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow
requirements:
MultipleInputFeatureRequirement: {}
ScatterFeatureRequirement: {}
StepInputExpressionRequirement: {}
InlineJavascriptRequirement: {}
SubworkflowFeatureRequirement: {}
inputs:
samples:
type:
type: array
items:
type: record
fields:
sampleId: string
steps:
make_file:
run: make_file.cwl
scatter: sample
in:
sample: samples
sampleId:
valueFrom: ${ return inputs.sample['sampleId']; }
out:
[ output_file ]
gather_files:
run: cat.cwl
in:
input_files: make_file/output_file
out:
[ output_file ]
outputs:
output_file:
type: File
outputSource: gather_files/output_file
input.json
{"samples":[ {"sampleId":"Sample1"}, {"sampleId":"Sample2"} ]}
Running the command without a container gives this error message
$ cwltool --outdir output --tmpdir-prefix tmp --tmp-outdir-prefix tmp --leave-tmpdir workflow.cwl input.json
INFO /home/conda/bin/cwltool 3.0.20201203173111
INFO Resolved 'workflow.cwl' to 'file:///home/_test_cwl_initialDir/workflow.cwl'
INFO [workflow ] start
INFO [workflow ] starting step make_file
INFO [step make_file] start
INFO [job make_file] /home/_test_cwl_initialDir/tmp5h392ved$ bash \
run.sh \
Sample1 > /home/_test_cwl_initialDir/tmp5h392ved/output.txt
INFO [job make_file] completed success
INFO [step make_file] start
INFO [job make_file_2] /home/_test_cwl_initialDir/tmpg5gfplqz$ bash \
run.sh \
Sample2 > /home/_test_cwl_initialDir/tmpg5gfplqz/output.txt
INFO [job make_file_2] completed success
INFO [step make_file] completed success
INFO [workflow ] starting step gather_files
INFO [step gather_files] start
ERROR Workflow error, try again with --debug for more information:
File staging conflict, trying to stage both /home/_test_cwl_initialDir/tmp5h392ved/output.txt and /home/_test_cwl_initialDir/tmpg5gfplqz/output.txt to the same target /home/_test_cwl_initialDir/tmpmyxk4qg7/some_dir/output.txt
However, if we modify cat.cwl
to un-comment these lines;
DockerRequirement:
dockerPull: ubuntu:latest
And re-run with Singularity, no error is shown;
$ cwltool --singularity --outdir output --tmpdir-prefix tmp --tmp-outdir-prefix tmp --leave-tmpdir workflow.cwl input.json
INFO /home/conda/bin/cwltool 3.0.20201203173111
INFO Resolved 'workflow.cwl' to 'file:///home/_test_cwl_initialDir/workflow.cwl'
INFO [workflow ] start
INFO [workflow ] starting step make_file
INFO [step make_file] start
INFO [job make_file] /home/_test_cwl_initialDir/tmp53kze362$ bash \
run.sh \
Sample1 > /home/_test_cwl_initialDir/tmp53kze362/output.txt
INFO [job make_file] completed success
INFO [step make_file] start
INFO [job make_file_2] /home/_test_cwl_initialDir/tmp95ehrjx1$ bash \
run.sh \
Sample2 > /home/_test_cwl_initialDir/tmp95ehrjx1/output.txt
INFO [job make_file_2] completed success
INFO [step make_file] completed success
INFO [workflow ] starting step gather_files
INFO [step gather_files] start
INFO Using local copy of Singularity image found in /home/_test_cwl_initialDir
INFO [job gather_files] /home/_test_cwl_initialDir/tmpqm_u_50b$ singularity \
--quiet \
exec \
--contain \
--ipc \
--pid \
--home \
/home/_test_cwl_initialDir/tmpqm_u_50b:/jPTSKv \
--bind \
/home/_test_cwl_initialDir/tmpzqg0emi1:/tmp:rw \
--pwd \
/jPTSKv \
/home/_test_cwl_initialDir/ubuntu:latest.sif \
bash \
run.sh > /home/_test_cwl_initialDir/tmpqm_u_50b/output.txt
INFO [job gather_files] completed success
INFO [step gather_files] completed success
INFO [workflow ] completed success
{
"output_file": {
"location": "file:///home/_test_cwl_initialDir/output/output.txt",
"basename": "output.txt",
"class": "File",
"checksum": "sha1$24c5362ecd1bfafba85f185f28b12c121d03ee12",
"size": 8,
"path": "/home/_test_cwl_initialDir/output/output.txt"
}
}
INFO Final process status is success
The output of the workflow is;
$ cat output/output.txt
Sample2
Sample1
is not present because the file that contained that string, /home/_test_cwl_initialDir/tmp53kze362/output.txt
, got overwritten;
$ tree tmpqm_u_50b/
tmpqm_u_50b/
├── run.sh
└── some_dir
└── output.txt
$ cat tmpqm_u_50b/some_dir/output.txt
Sample2
It seems like this is triggered by the inclusion of the DockerRequirement
in the CWL, since running the CWL without the DockerRequirement
but with --singularity
enable still triggers the error as expected.
So I guess the best solution would be for cwltool
to trigger the same error message in cases like this.
Your Environment
- cwltool version: 3.0.20201203173111
This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:
https://cwl.discourse.group/t/too-many-arguments-on-the-command-line/248/31
So, one method I am using right now to avoid this issue, is to insert UUID's into filenames;
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
baseCommand: [ "bash", "run.sh" ]
requirements:
InitialWorkDirRequirement:
listing:
- entryname: run.sh
entry: |-
set -euo pipefail
output_file=\$(python3 -c "import uuid; print('_maf2bed_merged.' + str(uuid.uuid4()) + '.bed')")
grep -v '#' "$1" | grep -v 'Hugo' | cut -f5-7 | sort -V -k1,1 -k2,2n > "\$output_file"
inputs:
maf_file:
type: File
inputBinding:
position: 1
outputs:
output_file:
type: File
outputBinding:
glob: _maf2bed_merged.*.bed
of course this only works if you are the one creating the files with duplicate names inside your pipeline, it shouldnt provide any protection from pipeline input files that have the same basename.
This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:
https://cwl.discourse.group/t/how-to-create-a-uuid-inside-cwl/303/1