cwltool
cwltool copied to clipboard
Docker execution will hang waiting for the CID file id the container cannot be created
I ran into a problem with cwltool running as part of a Toil workflow. It tried and failed to launch a Docker container, and then sat there forever, with an unreaped, defunct child docker process. Attaching gdb showed the Python process sleeping. Here's the relevant part of the Toil worker log:
[2021-02-17T10:08:41-0800] [MainThread] [D] [toil.cwl.cwltoil] Runtime Context environment: {}
[2021-02-17T10:08:41-0800] [MainThread] [D] [toil.cwl.cwltoil] Running CWL job: {'r': {'location': 'toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt', 'basename': 'out.txt', 'nameroot': 'out', 'nameext': '.txt', 'class': 'File', 'checksum': 'sha1$a3db5c13ff90a36963278c6a39e4ee3c22e2a436', 'size': 2, 'http://commonwl.org/cwltool#generation': 0}, 'script': ordereddict([('class', 'File'), ('location', 'file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py')])}
[2021-02-17T10:08:41-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] initializing from file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval_inplace.cwl
[2021-02-17T10:08:41-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] {
"r": {
"location": "toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt",
"basename": "out.txt",
"nameroot": "out",
"nameext": ".txt",
"class": "File",
"checksum": "sha1$a3db5c13ff90a36963278c6a39e4ee3c22e2a436",
"size": 2,
"http://commonwl.org/cwltool#generation": 0
},
"script": {
"class": "File",
"location": "file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py",
"basename": "updateval.py",
"nameroot": "updateval",
"nameext": ".py",
"size": 99
}
}
[2021-02-17T10:08:42-0800] [MainThread] [D] [toil.jobStores.abstractJobStore] Unable to import 'toil.jobStores.googleJobStore' as is expected if the corresponding extra was omitted at installation time.
[2021-02-17T10:08:42-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] path mappings is {
"toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt": [
"/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpbyfabii6.tmp",
"/CnduBo/out.txt",
"WritableFile",
false
],
"file:///public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tests/updateval.py": [
"/data/tmp/tmpm20vzgqc/files/no-job/file-53f5505d4451415c9f0c56c0e2fb8b0b/updateval.py",
"/var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py",
"File",
true
]
}
[2021-02-17T10:08:42-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] command line bindings is [
{
"position": [
0,
0
],
"datum": "python"
},
{
"position": [
0,
1
],
"valueFrom": "$(inputs.script)"
},
{
"position": [
0,
2
],
"valueFrom": "$(inputs.r.basename)"
}
]
[2021-02-17T10:08:59-0800] [MainThread] [I] [cwltool] ['docker', 'pull', 'python:2.7.15-alpine3.7']
2.7.15-alpine3.7: Pulling from library/python
48ecbb6b270e: Pulling fs layer
81f9ab63a5a5: Pulling fs layer
d11afbf926bd: Pulling fs layer
502a70b94b66: Pulling fs layer
502a70b94b66: Waiting
81f9ab63a5a5: Verifying Checksum
81f9ab63a5a5: Download complete
48ecbb6b270e: Verifying Checksum
48ecbb6b270e: Download complete
502a70b94b66: Verifying Checksum
502a70b94b66: Download complete
d11afbf926bd: Verifying Checksum
d11afbf926bd: Download complete
48ecbb6b270e: Pull complete
81f9ab63a5a5: Pull complete
d11afbf926bd: Pull complete
502a70b94b66: Pull complete
Digest: sha256:95bdfc0e9fbf57ee252ede6fa3d81dc5d7739aab6b867558f22d06b1c1d9ad81
Status: Downloaded newer image for python:2.7.15-alpine3.7
[2021-02-17T10:09:10-0800] [MainThread] [D] [cwltool] [job updateval_inplace.cwl] initial work dir {
"toilfs:2:0:files/for-job/kind-CWLJob/instance-wflk78uk/file-d57a54b0cb6f46c9803b5dea2821d231/out.txt": [
"/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpaf9vmqmz.tmp",
"/CnduBo/out.txt",
"WritableFile",
true
]
}
[2021-02-17T10:09:10-0800] [MainThread] [W] [cwltool] [job updateval_inplace.cwl] Skipping Docker software container '--memory' limit despite presence of ResourceRequirement with ramMin and/or ramMax setting. Consider running with --strict-memory-limit for increased portability assurance.
[2021-02-17T10:09:10-0800] [MainThread] [I] [cwltool] [job updateval_inplace.cwl] /public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t1cuciawz/tmp-outecotb164$ docker \
run \
-i \
--mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t1cuciawz/tmp-outecotb164,target=/CnduBo \
--mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t5p82nbyd_3dptcqo,target=/tmp \
--mount=type=bind,source=/data/tmp/tmpm20vzgqc/files/no-job/file-53f5505d4451415c9f0c56c0e2fb8b0b/updateval.py,target=/var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py,readonly \
--mount=type=bind,source=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/tmpaf9vmqmz.tmp,target=/CnduBo/out.txt \
--workdir=/CnduBo \
--read-only=true \
--net=none \
--user=1974:2000 \
--rm \
--env=TMPDIR=/tmp \
--env=HOME=/CnduBo \
--cidfile=/public/home/anovak/build/toil/src/toil/test/cwl/spec_v11/tmpy1ihswb0/node-343ddfe4-83d1-454b-a320-83ec5cf72a5f-b702cc0a-c010-4a37-9a9f-e833b37b96c8/tmpqnr_w8tm/fc2c9aac-0ceb-44ea-9806-849ee279095a/t5p82nbydpbcbr5pz/20210217100910-977042.cid \
python:2.7.15-alpine3.7 \
python \
/var/lib/cwl/stg1487b8f7-f0c4-44cb-9fb6-2b380ff1edf5/updateval.py \
out.txt
docker: Error response from daemon: invalid mount config for type "bind": bind source path does not exist.
See 'docker run --help'.
I think that, because the Docker daemon never managed to create the container, the monitoring process here is stuck waiting for a CID file that will never come:
https://github.com/common-workflow-language/cwltool/blob/70dafe0d36ac40246cb2629c08993e68ef36165d/cwltool/job.py#L848-L864
And because _job_popen() runs the monitoring function to completion before checking to see if the child process being monitored is alive, this turns into a deadlock:
https://github.com/common-workflow-language/cwltool/blob/70dafe0d36ac40246cb2629c08993e68ef36165d/cwltool/job.py#L972-L974
As for why the Docker call is failing, even when cwltool is supposed to be creating the directories being bind mounted, my guess is that I'm running over a not particularly consistent shared file system. When I went in to look, all four of the sources for the bind mounts existed; perhaps it's possible for the Docker daemon to not see the directory creation as having happened yet, and the filesystem is only really guaranteeing that it won't be able to make a file with the same name.
cwltool should detect when the Docker daemon fails before making the CID file, instead of hanging. If possible, it should detect when the Docker daemon is complaining that it can't see directories we made, and retry for a little bit of real time to allow any shared filesystems to settle.
Moving all the working directories involved to a local /tmp instead of a shared filesystem seems to be a workaround for the problem, which is consistent with the theory that the Docker container startup is failing due to a race with the visibility of the new directories.
Hey @adamnovak , thanks for reporting this!
cwltoolshould detect when the Docker daemon fails before making the CID file, instead of hanging. If possible, it should detect when the Docker daemon is complaining that it can't see directories we made, and retry for a little bit of real time to allow any shared filesystems to settle.
Does docker: Error response from daemon: invalid mount config for type "bind": bind source path does not exist. result in a unique exit code?
https://docs.docker.com/engine/reference/run/#exit-status
We should special case the exit codes 125, 126, and 127, regardless. Perhaps we add them to the temporaryFail list?
Currently, it appears that neither cwltool nor toil-cwl-runner, arvados-cwl-runner, nor REANA does discriminate between a temporaryFail and a permanentFail.
So if we go that route, then there should be logic to retry any temporaryFail perhaps inside JobBase._execute() itself
I would confine the retry logic to the docker call to create the container, instead of backing out to the whole job. It needs to re-run with the same input paths to give their creation a chance to become visible.
This issue has been mentioned on Common Workflow Language Discourse. There might be relevant details there:
https://cwl.discourse.group/t/docker-error-invalid-mount-config-for-type-bind/598/1