toil icon indicating copy to clipboard operation
toil copied to clipboard

"File staging conflict" error is suppressed when using a container; files get overwritten

Open stevekm opened this issue 4 years ago • 2 comments

This is the same issue as described here; https://github.com/common-workflow-language/cwltool/issues/1403

Using the same CWL files as in that Issue with cwltool, when you run Toil with a CWL that uses InitialWorkDirRequirement like this;

  InitialWorkDirRequirement:
    listing:
      - entryname: some_dir
        writable: true
        entry: "$({class: 'Directory', listing: inputs.input_files})"

you get a File staging conflict error:

$ toil-cwl-runner workflow.cwl input.json
...
...
[2021-02-08T11:58:02-0500] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
[2021-02-08T11:58:02-0500] [MainThread] [I] [toil] Running Toil version 5.2.0-047d0c4f2949c576c80e452a0807c5be6355c63d on host server.
[2021-02-08T11:58:02-0500] [MainThread] [I] [toil.worker] Working on job 'CWLJob' bash run.sh kind-CWLJob/instance-vjb1bunf
[2021-02-08T11:58:02-0500] [MainThread] [I] [toil.worker] Loaded body Job('CWLJob' bash run.sh kind-CWLJob/instance-vjb1bunf) from description 'CWLJob' bash run.sh kind-CWLJob/instance-vjb1bunf
[2021-02-08T11:58:02-0500] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2021-02-08T11:58:02-0500] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-CWLJob/instance-unfjem45/file-cd83ea62575a44098a4fdb15a6dd6790/output.txt' to path '/tmp/node-efa42d50-baaa-41db-b0b9-d26bc000e945-ae225bf2-14b6-4c6a-9f5b-1dfe49b07a51/tmpgz22ltyt/67f17b0c-d724-45b0-8185-498cb56a6ced/tmpm3b65eu7.tmp'
[2021-02-08T11:58:02-0500] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-CWLJob/instance-7gxq1o32/file-e570db2d16284f1fb9359a454f04ba2a/output.txt' to path '/tmp/node-efa42d50-baaa-41db-b0b9-d26bc000e945-ae225bf2-14b6-4c6a-9f5b-1dfe49b07a51/tmpgz22ltyt/67f17b0c-d724-45b0-8185-498cb56a6ced/tmpeef58r6u.tmp'
[2021-02-08T11:58:02-0500] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-CWLJob/instance-unfjem45/file-cd83ea62575a44098a4fdb15a6dd6790/output.txt' to path '/tmp/node-efa42d50-baaa-41db-b0b9-d26bc000e945-ae225bf2-14b6-4c6a-9f5b-1dfe49b07a51/tmpgz22ltyt/67f17b0c-d724-45b0-8185-498cb56a6ced/tmp266g1809.tmp'
[2021-02-08T11:58:02-0500] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-CWLJob/instance-7gxq1o32/file-e570db2d16284f1fb9359a454f04ba2a/output.txt' to path '/tmp/node-efa42d50-baaa-41db-b0b9-d26bc000e945-ae225bf2-14b6-4c6a-9f5b-1dfe49b07a51/tmpgz22ltyt/67f17b0c-d724-45b0-8185-498cb56a6ced/tmpnjzcsd2f.tmp'
Traceback (most recent call last):
  File "/home/conda/lib/python3.7/site-packages/toil/worker.py", line 394, in workerScript
    job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
  File "/home/conda/lib/python3.7/site-packages/toil/job.py", line 2359, in _runner
    returnValues = self._run(jobGraph=None, fileStore=fileStore)
  File "/home/conda/lib/python3.7/site-packages/toil/job.py", line 2280, in _run
    return self.run(fileStore)
  File "/home/conda/lib/python3.7/site-packages/toil/cwl/cwltoil.py", line 1222, in run
    logger=cwllogger,
  File "/home/conda/lib/python3.7/site-packages/cwltool/executors.py", line 150, in execute
    self.run_jobs(process, job_order_object, logger, runtime_context)
  File "/home/conda/lib/python3.7/site-packages/cwltool/executors.py", line 257, in run_jobs
    job.run(runtime_context)
  File "/home/conda/lib/python3.7/site-packages/cwltool/job.py", line 566, in run
    secret_store=runtimeContext.secret_store,
  File "/home/conda/lib/python3.7/site-packages/cwltool/process.py", line 287, in stage_files
    % (targets[entry.target].resolved, entry.resolved, entry.target)
cwltool.errors.WorkflowException: File staging conflict, trying to stage both /tmp/node-efa42d50-baaa-41db-b0b9-d26bc000e945-ae225bf2-14b6-4c6a-9f5b-1dfe49b07a51/tmpgz22ltyt/67f17b0c-d724-45b0-8185-498cb56a6ced/tmpm3b65eu7.tmp and /tmp/node-efa42d50-baaa-41db-b0b9-d26bc000e945-ae225bf2-14b6-4c6a-9f5b-1dfe49b07a51/tmpgz22ltyt/67f17b0c-d724-45b0-8185-498cb56a6ced/tmpeef58r6u.tmp to the same target /tmp/node-efa42d50-baaa-41db-b0b9-d26bc000e945-ae225bf2-14b6-4c6a-9f5b-1dfe49b07a51/tmpgz22ltyt/67f17b0c-d724-45b0-8185-498cb56a6ced/tjxzkadyk/tmp-oute3hzotys/some_dir/output.txt
[2021-02-08T11:58:02-0500] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host server

However if you add a container requirement;

  DockerRequirement:
    dockerPull: ubuntu:latest

The File staging conflict does not occur, and the files overwrite each other upon being staged in the directory

$ toil-cwl-runner --singularity workflow.cwl input.json
...

[2021-02-08T12:05:28-0500] [MainThread] [I] [toil.leader] Finished toil run successfully.

Workflow Progress 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 (0 failures) [00:50<00:00, 0.20 jobs/s]
{
    "output_file": {
        "location": "file:///home/_test_cwl_initialDir/output.txt",
        "basename": "output.txt",
        "nameroot": "output",
        "nameext": ".txt",
        "class": "File",
        "checksum": "sha1$24c5362ecd1bfafba85f185f28b12c121d03ee12",
        "size": 8
    }
}[2021-02-08T12:05:28-0500] [MainThread] [I] [toil.common] Successfully deleted the job store: FileJobStore(/tmp/tmp3iv1zy24)

$ cat output.txt
Sample2

The File staging conflict error needs to be raised in this case so that you can make sure you are not silently losing files in your workflow.

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-793

stevekm avatar Feb 08 '21 17:02 stevekm

@stevekm Thank you for raising this issue. A cwltool fix should fix things in Toil. We'll try to stay apprised of developments there and update accordingly.

DailyDreaming avatar Feb 10 '21 06:02 DailyDreaming

It looks like this problem arises when the user's CWL code constructs a CWL Directory object with a listing that is not actually acceptable as a Directory's listing, because it contains multiple entries with the same name and thus can't ever be physically realized on disk.

This doesn't happen just when you take two input File objects and try and e.g. pass them to a command line tool, and they happen to have the same basename. That works regardless of whether Singularity is in use, right?

cwltool can sometimes detect a broken Directory at the file staging step, and instead of showing you an arbitrary one of the files, fails the whole workflow. But when using Singularity, it instead just shows you an arbitrary one of the files. Toil gets basically the same behavior as cwltool has by calling into it.

I feel like really this should be detected at a different point. InitialWorkDirRequirement should have its own listing pre-checked before we attempt to stage it, and if the listing tree is self-contradictory we should refuse to continue. without trying to actually stage anything.

I'm also not sure this is a Toil fix, though.

adamnovak avatar Aug 01 '22 21:08 adamnovak

I've tested this on Toil commit 5280227633703372ce06923f2164bcf1aad65a0d and without the DockerRequirement we get the file staging conflict. With the DockerRequirement we don't, but I also see both Sample1 and Sample2 in the output file:

{
    "output_file": {
        "location": "file:///Users/anovak/workspace/toil/output.txt",
        "basename": "output.txt",
        "nameroot": "output",
        "nameext": ".txt",
        "class": "File",
        "checksum": "sha1$4c092e98f5dc520a853c1f2f1db4e1b14fe2955d",
        "size": 16
    }
}[2023-04-20T16:23:00-0400] [MainThread] [I] [toil.common] Successfully deleted the job store: FileJobStore(/var/folders/0n/4y413_9s7y70lmm3yhtt3b8m0000gq/T/tmp4oyctazw)
(venv) [anovak@swords toil]% cat output.txt
Sample1
Sample2

So it could be that CWLTool changed something here.

Do we really want to force the file staging conflict error if cwltool is working around it? Probably for portability...

adamnovak avatar Apr 20 '23 20:04 adamnovak

It could also be that I am testing with Docker, and we run into the problem with loss of files only on Singularity.

adamnovak avatar Apr 20 '23 20:04 adamnovak