cromwell
cromwell copied to clipboard
HTTP input file names not respected in execution VM
Hi all; In testing release 35 with CWL inputs I've also been looking at supporting remote URL references. This is working correctly for GS URLs but not for http URLs. I've put together a test case that demonstrates the problem:
https://github.com/bcbio/test_bcbio_cwl/tree/master/gcp
The somatic-workflow-http
CWL workflow uses http URLs and doesn't work, while the comparable somatic-workflow
CWL workflow uses GS URLs referencing the same data and does work.
The workflow fails with:
java.io.FileNotFoundException: Cannot hash file https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genom
es/hg19/seq/hg19.fa
when running tasks. The files get downloaded to the input directories but get numerical values instead of the original file names so never seem to sync over and get translated correctly to the workflow
ls -lh cromwell_work/cromwell-executions/main-somatic.cwl/eaa632df-52a8-4aae-826f-647a42fa7145/call-prep_samples_to_rec/inputs/1515144/
total 136K
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 225050424226294657
-rw------- 2 chapmanb chapmanb 43 Sep 26 14:07 2612405277530248055
-rw------- 2 chapmanb chapmanb 43 Sep 26 14:07 503001634356675169
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 5802330287039666628
-rw------- 2 chapmanb chapmanb 43 Sep 26 14:07 5809676514510180826
-rw------- 2 chapmanb chapmanb 43 Sep 26 14:07 6090832304768530540
-rw------- 2 chapmanb chapmanb 43 Sep 26 14:07 6105514522473810611
-rw------- 3 chapmanb chapmanb 37K Sep 26 14:07 6807576659333162957
-rw------- 3 chapmanb chapmanb 150 Sep 26 14:07 6853384576121493061
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7483350933664987331
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7538690575330349970
-rw------- 3 chapmanb chapmanb 37K Sep 26 14:07 7691692211431528147
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7783203266940950463
-rw------- 3 chapmanb chapmanb 150 Sep 26 14:07 8389565043859020157
-rw------- 2 chapmanb chapmanb 43 Sep 26 14:07 8932347409858620277
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 993751307168383758
My configuration is:
engine {
filesystems {
gcs {
auth = "application-default"
}
http {}
}
}
backend {
providers {
Local {
config {
filesystems {
http { }
}
}
}
}
}
Am I doing anything wrong with my configuration or setup that I could tweak? Thanks so much for any pointers/suggestions.
The hash failures are expected with http inputs and should not be the cause of your workflow failure. Also we don't currently support http
in engine filesystems. Do you see any other error messages than might provide some insight into what's happening?
Thanks much for the helping with debugging on this.
Beyond the hash failure from Cromwell the other errors I get are all from the workflow itself due to not preserving the original file names. The numerical hashes for files get passed directly into the downstream tools, stripping off any extensions or other identifying information. This results in tool confusion, like tabix can't tell a file wasn't already gzipped:
ValueError: Unexpected tabix input: /home/chapmanb/drive/work/cwl/test_bcbio_cwl/gcp/cromwell_work/cromwell-executions/main-somatic.cwl/93ef2d1c-88ee-4dc2-af0a-e0ea86bc785e/call-prep_samples/shard-0/execution/bedprep/cleaned-8539016497173364825.gz
or bwa can't find all the other associated indices:
bwa mem /home/chapmanb/drive/work/cwl/test_bcbio_cwl/gcp/cromwell_work/cromwell-executions/main-somatic.cwl/93ef2d1c-88ee-4dc2-af0a-e0ea86bc785e/call-alignment/shard-1/wf-alignment.cwl/96d7b606-e0fe-4305-a586-e0fc4acf76f8/call-process_alignment/shard-0/inputs/1628767813 [...]
[E::bwa_idx_load_from_disk] fail to locate the index files
Is it expected to lose the original input file names when passing through the pipeline. A lot of tools are sensitive to these and this might be the underlying issue.
Regarding the configuration, without http {}
in under engine -> filesystems
I get a complaint about it not being supported, even with http {}
under backend -> providers -> Local -> config -> filesystems
:
java.lang.IllegalArgumentException: Either https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa exists on a filesystem not supported by this instance of Cromwell, or a failure occurred while building an actionable path from it. Supported filesystems are: LinuxFileSystem. Failures: LinuxFileSystem: Cannot build a local path from https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa (RuntimeException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.cromwell$engine$workflow$lifecycle$materialization$MaterializeWorkflowDescriptorActor$$workflowInitializationFailed(MaterializeWorkflowDescriptorActor.scala:211)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:181)
at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:176)
I was trying to lift off how things were done with the Google/gcp resolution so added it in there to fix this issue. Is there a different configuration approach I should be using?
Thanks again for helping with this.
Hi Brad
In addition to the http
entry in the backend filesystems config the http
filesystem also needs to be defined system-wide. Cromwell's reference.conf
defines this already so as long as that's being pulled into your configuration and you're not overriding the filesystems you should be set.
Thanks for this help with the configuration. I'm not intentionally overwriting the global filesystem
, but I don't have an explicit import of reference.conf. Do I need to have an import like we do for application
? Do you spot anything else I might be doing wrong?
https://gist.github.com/chapmanb/72c6bf2d8282412b252f6192968b17cf
I appreciate all the help debugging this.
Within your http
filesystem, I suspect you need enabled: true
, ala https://github.com/broadinstitute/cromwell/blob/35_hotfix/core/src/main/resources/reference.conf#L349
@cjllanwarne Did you try that or are you making an educated guess? I haven't yet seen anyone need to do that, although AFAIK (educated guess myself) the http
stuff isn't properly wired into the engine level stuff at all. My suspicion, having run into something similar myself, is that this is a "doesn't really work with CWL" issue, but could be wrong there
Chris, thanks for the idea. I tried this and unfortunately had the same issue. From the behavior it looks like http is working in that the files get downloaded, but they don't get proper naming with numerical names instead of the expected file names. This disconnect seems to be what causes issues when passing these on to the CWL tools.
Hi @chapmanb - sorry for the delay in responding here.
I was able to get http inputs to work in CWL against a default (ie no custom config specified) instance of Cromwell in server mode. The test case I used is in the linked PR (#4392)
I wonder whether you could confirm:
- Whether this test case works for you, and if so:
- Is your use of HTTP inputs different somehow?
- How can I enhance my test case to cover whatever is different?
- Or, whether this test case does not work for you, and if so:
- We might try to work out what is different between your configuration and the default which might be breaking things
Chris; Thanks for working on this and for the test case to iterate with. This example does work for me in the sense that it generates an md5sum, but also demonstrates the underlying issue I'm having with https inputs. I also get them downloaded and staged into my pipeline, but the file names get mangled into random download number. md5sum is cool with this, but many of my real tasks fail because the expected file extensions and associated secondary file extensions get lost with the random file names.
Here's the example output I get from running this that demonstrates the file naming issue:
/usr/bin/md5sum '/home/chapmanb/tmp/cromwell/cromwell_work/cromwell-executions/main-http_inputs.cwl/093e2835-e4cc-4731-9248-88d74dec0977/call-sum/inputs/1515144/1710814112361209342' | cut -c1-32
This input should be called jamie_the_cromwell_pig.png
but instead gets a long number attached to it. Is it possible to preserve initial file names with https like happens with other filesystem types?
In terms of the test cases, it would be great if it also checked that the file extension and name get preserved.
Thanks again for looking at this.
Is there a fix in the latest development for this issue? I'm still stuck on this so not sure if I missed something.
I was wondering the same. @rebrrown1395 it’s not obvious why this issue was closed and as it’s from an external user it should be explained prior to closing
@geoffjentry and @chapmanb, my apologies this wasn't supposed to be closed!
Hello! Is there an update or workaround on this? I am experiencing the same issue