cromwell icon indicating copy to clipboard operation
cromwell copied to clipboard

Code to repro broken reference disks in GCP Batch [WX-1819]

Open mcovarr opened this issue 6 months ago • 0 comments

Description

Reference disks currently appear to be broken in the GCP Batch backend. This PR adds a little bit of Centaur infrastructure and a copy/paste/modify of a basic Papi v2 reference disk test to demonstrate the issues.

Currently when I submit the Centaur test added in this PR, the job fails before invoking the user command with an exit code of 125, which appears to be a Docker error. Indeed in Logs Explorer I see this:

docker: Error response from daemon: invalid mode: async, rw.

which apparently has to do with this code which is explicitly trying to specify read-write and async for the reference volumes to be mounted. From the docs, async does not appear to be an option for non-NFS Docker volumes.

Inspecting the batch job description, I see a command like this:

"printf '%s %s\\n' \"$(date -u '+%Y/%m/%d %H:%M:%S')\" Running\\ user\\ runnable:\\ docker\\ run\\ -v\\ /mnt/disks/cromwell_root:/mnt/disks/cromwell_root\\ -v\\ /mnt/11a4324d4472f639f3fc558b00afeacd:/mnt/11a4324d4472f639f3fc558b00afeacd:async,\\\\\\ rw\\ -v\\ /mnt/d9e025138b28caa42dd4006fc3636661:/mnt/d9e025138b28caa42dd4006fc3636661:async,\\\\\\ rw\\ --entrypoint\\=/bin/bash\\ ubuntu@sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee\\ /mnt/disks/cromwell_root/script"

i.e., explicitly specifying async, rw. By comparison the working Papiv2 reference disk system explicitly specifies ro:

"printf '%s %s\\n' \"$(date -u '+%Y/%m/%d %H:%M:%S')\" Running\\ user\\ action:\\ docker\\ run\\ -v\\ /mnt/local-disk:/cromwell_root\\ -v\\ /mnt/d-312601206d5deb55b631d02269f3b3a5:/mnt/11a4324d4472f639f3fc558b00afeacd:ro\\ -v\\ /mnt/d-c74a541aa27f13cfe59c2f998a664729:/mnt/d9e025138b28caa42dd4006fc3636661:ro\\ --entrypoint\\=/bin/bash\\ ubuntu@sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee\\ /cromwell_root/script"

I attempted to modify the GCP Batch backend to pass ro, but for some reason that ro does not seem to make it to the Docker command line.

"printf '%s %s\\n' \"$(date -u '+%Y/%m/%d %H:%M:%S')\" Running\\ user\\ runnable:\\ docker\\ run\\ -v\\ /mnt/disks/cromwell_root:/mnt/disks/cromwell_root\\ -v\\ /mnt/11a4324d4472f639f3fc558b00afeacd:/mnt/11a4324d4472f639f3fc558b00afeacd\\ -v\\ /mnt/d9e025138b28caa42dd4006fc3636661:/mnt/d9e025138b28caa42dd4006fc3636661\\ --entrypoint\\=/bin/bash\\ ubuntu@sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee\\ /mnt/disks/cromwell_root/script"

However ro does seem to be applied to the volume specifications:

"volumes": [
    "/mnt/disks/cromwell_root:/mnt/disks/cromwell_root:rw",
    "/mnt/11a4324d4472f639f3fc558b00afeacd:/mnt/11a4324d4472f639f3fc558b00afeacd:ro",
    "/mnt/d9e025138b28caa42dd4006fc3636661:/mnt/d9e025138b28caa42dd4006fc3636661:ro"
]

The main volume is read-write as expected, and the two volumes corresponding to the reference disks are read-only. However the reference volumes being read-only seems to be an issue for Docker:

docker: Error response from daemon: error while creating mount source path '/mnt/11a4324d4472f639f3fc558b00afeacd': mkdir /mnt/11a4324d4472f639f3fc558b00afeacd: read-only file system."

Release Notes Confirmation

CHANGELOG.md

  • [ ] I updated CHANGELOG.md in this PR
  • [x] I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • [ ] I added a suggested release notes entry in this Jira ticket
  • [x] I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

mcovarr avatar Aug 20 '24 22:08 mcovarr