cwltool icon indicating copy to clipboard operation
cwltool copied to clipboard

"File staging conflict" error is suppressed when using a container; files get overwritten

Open stevekm opened this issue 3 years ago • 3 comments

When you try to stage input files with an InitialWorkDirRequirement, and the files have the same name, they normally trigger an error that halts the workflow. However, if you are using a Singularity container, the error message does not arise and the files get silently overwritten.

Here is code reproducing the error;

make_file.cwl

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
baseCommand: [ "bash", "run.sh" ]

stdout: output.txt

requirements:
  InitialWorkDirRequirement:
    listing:
      - entryname: run.sh
        entry: |-
          echo "$1"

inputs:
  sampleId:
    type: string
    inputBinding:
      position: 1

outputs:
  output_file:
    type: stdout

cat.cwl

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
baseCommand: [ "bash", "run.sh" ]

requirements:
  # DockerRequirement: # <- this is the part that causes the error
  #   dockerPull: ubuntu:latest
  InitialWorkDirRequirement:
    listing:
      - entryname: some_dir # <- put all the input files into a dir
        writable: true
        entry: "$({class: 'Directory', listing: inputs.input_files})"
      - entryname: run.sh
        entry: |-
          for i in \$(find some_dir -type f); do cat \$i ; done

stdout: output.txt

inputs:
  input_files:
    type: File[]

outputs:
  output_file:
    type: stdout

workflow.cwl

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow
requirements:
  MultipleInputFeatureRequirement: {}
  ScatterFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  SubworkflowFeatureRequirement: {}

inputs:
  samples:
    type:
      type: array
      items:
        type: record
        fields:
          sampleId: string

steps:
  make_file:
    run: make_file.cwl
    scatter: sample
    in:
      sample: samples
      sampleId:
        valueFrom: ${ return inputs.sample['sampleId']; }
    out:
      [ output_file ]

  gather_files:
    run: cat.cwl
    in:
      input_files: make_file/output_file
    out:
      [ output_file ]

outputs:
  output_file:
    type: File
    outputSource: gather_files/output_file

input.json

{"samples":[ {"sampleId":"Sample1"}, {"sampleId":"Sample2"} ]}

Running the command without a container gives this error message

$ cwltool --outdir output --tmpdir-prefix tmp --tmp-outdir-prefix tmp --leave-tmpdir workflow.cwl input.json
INFO /home/conda/bin/cwltool 3.0.20201203173111
INFO Resolved 'workflow.cwl' to 'file:///home/_test_cwl_initialDir/workflow.cwl'
INFO [workflow ] start
INFO [workflow ] starting step make_file
INFO [step make_file] start
INFO [job make_file] /home/_test_cwl_initialDir/tmp5h392ved$ bash \
    run.sh \
    Sample1 > /home/_test_cwl_initialDir/tmp5h392ved/output.txt
INFO [job make_file] completed success
INFO [step make_file] start
INFO [job make_file_2] /home/_test_cwl_initialDir/tmpg5gfplqz$ bash \
    run.sh \
    Sample2 > /home/_test_cwl_initialDir/tmpg5gfplqz/output.txt
INFO [job make_file_2] completed success
INFO [step make_file] completed success
INFO [workflow ] starting step gather_files
INFO [step gather_files] start
ERROR Workflow error, try again with --debug for more information:
File staging conflict, trying to stage both /home/_test_cwl_initialDir/tmp5h392ved/output.txt and /home/_test_cwl_initialDir/tmpg5gfplqz/output.txt to the same target /home/_test_cwl_initialDir/tmpmyxk4qg7/some_dir/output.txt

However, if we modify cat.cwl to un-comment these lines;

  DockerRequirement:
    dockerPull: ubuntu:latest

And re-run with Singularity, no error is shown;

$ cwltool --singularity --outdir output --tmpdir-prefix tmp --tmp-outdir-prefix tmp --leave-tmpdir workflow.cwl input.json
INFO /home/conda/bin/cwltool 3.0.20201203173111
INFO Resolved 'workflow.cwl' to 'file:///home/_test_cwl_initialDir/workflow.cwl'
INFO [workflow ] start
INFO [workflow ] starting step make_file
INFO [step make_file] start
INFO [job make_file] /home/_test_cwl_initialDir/tmp53kze362$ bash \
    run.sh \
    Sample1 > /home/_test_cwl_initialDir/tmp53kze362/output.txt
INFO [job make_file] completed success
INFO [step make_file] start
INFO [job make_file_2] /home/_test_cwl_initialDir/tmp95ehrjx1$ bash \
    run.sh \
    Sample2 > /home/_test_cwl_initialDir/tmp95ehrjx1/output.txt
INFO [job make_file_2] completed success
INFO [step make_file] completed success
INFO [workflow ] starting step gather_files
INFO [step gather_files] start
INFO Using local copy of Singularity image found in /home/_test_cwl_initialDir
INFO [job gather_files] /home/_test_cwl_initialDir/tmpqm_u_50b$ singularity \
    --quiet \
    exec \
    --contain \
    --ipc \
    --pid \
    --home \
    /home/_test_cwl_initialDir/tmpqm_u_50b:/jPTSKv \
    --bind \
    /home/_test_cwl_initialDir/tmpzqg0emi1:/tmp:rw \
    --pwd \
    /jPTSKv \
    /home/_test_cwl_initialDir/ubuntu:latest.sif \
    bash \
    run.sh > /home/_test_cwl_initialDir/tmpqm_u_50b/output.txt
INFO [job gather_files] completed success
INFO [step gather_files] completed success
INFO [workflow ] completed success
{
    "output_file": {
        "location": "file:///home/_test_cwl_initialDir/output/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$24c5362ecd1bfafba85f185f28b12c121d03ee12",
        "size": 8,
        "path": "/home/_test_cwl_initialDir/output/output.txt"
    }
}
INFO Final process status is success

The output of the workflow is;

$ cat output/output.txt
Sample2

Sample1 is not present because the file that contained that string, /home/_test_cwl_initialDir/tmp53kze362/output.txt, got overwritten;

$ tree tmpqm_u_50b/
tmpqm_u_50b/
├── run.sh
└── some_dir
    └── output.txt

$ cat tmpqm_u_50b/some_dir/output.txt
Sample2

It seems like this is triggered by the inclusion of the DockerRequirement in the CWL, since running the CWL without the DockerRequirement but with --singularity enable still triggers the error as expected.

So I guess the best solution would be for cwltool to trigger the same error message in cases like this.

Your Environment

  • cwltool version: 3.0.20201203173111

stevekm avatar Feb 08 '21 16:02 stevekm

This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/too-many-arguments-on-the-command-line/248/31

cwl-bot avatar Feb 08 '21 17:02 cwl-bot

So, one method I am using right now to avoid this issue, is to insert UUID's into filenames;

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
baseCommand: [ "bash", "run.sh" ]

requirements:
  InitialWorkDirRequirement:
    listing:
      - entryname: run.sh
        entry: |-
          set -euo pipefail
          output_file=\$(python3 -c "import uuid; print('_maf2bed_merged.' + str(uuid.uuid4()) + '.bed')")
          grep -v '#' "$1" | grep -v 'Hugo' | cut -f5-7 | sort -V -k1,1 -k2,2n > "\$output_file"

inputs:
  maf_file:
    type: File
    inputBinding:
      position: 1

outputs:
  output_file:
    type: File
    outputBinding:
      glob: _maf2bed_merged.*.bed

of course this only works if you are the one creating the files with duplicate names inside your pipeline, it shouldnt provide any protection from pipeline input files that have the same basename.

stevekm avatar Feb 09 '21 15:02 stevekm

This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/how-to-create-a-uuid-inside-cwl/303/1

cwl-bot avatar Feb 10 '21 19:02 cwl-bot