cromwell icon indicating copy to clipboard operation
cromwell copied to clipboard

HTTP input file names not respected in execution VM

Open chapmanb opened this issue 6 years ago • 13 comments

Hi all; In testing release 35 with CWL inputs I've also been looking at supporting remote URL references. This is working correctly for GS URLs but not for http URLs. I've put together a test case that demonstrates the problem:

https://github.com/bcbio/test_bcbio_cwl/tree/master/gcp

The somatic-workflow-http CWL workflow uses http URLs and doesn't work, while the comparable somatic-workflow CWL workflow uses GS URLs referencing the same data and does work.

The workflow fails with:

java.io.FileNotFoundException: Cannot hash file https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genom
es/hg19/seq/hg19.fa

when running tasks. The files get downloaded to the input directories but get numerical values instead of the original file names so never seem to sync over and get translated correctly to the workflow

ls -lh cromwell_work/cromwell-executions/main-somatic.cwl/eaa632df-52a8-4aae-826f-647a42fa7145/call-prep_samples_to_rec/inputs/1515144/
total 136K
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 225050424226294657
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 2612405277530248055
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 503001634356675169
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 5802330287039666628
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 5809676514510180826
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 6090832304768530540
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 6105514522473810611
-rw------- 3 chapmanb chapmanb 37K Sep 26 14:07 6807576659333162957
-rw------- 3 chapmanb chapmanb 150 Sep 26 14:07 6853384576121493061
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7483350933664987331
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7538690575330349970
-rw------- 3 chapmanb chapmanb 37K Sep 26 14:07 7691692211431528147
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 7783203266940950463
-rw------- 3 chapmanb chapmanb 150 Sep 26 14:07 8389565043859020157
-rw------- 2 chapmanb chapmanb  43 Sep 26 14:07 8932347409858620277
-rw------- 2 chapmanb chapmanb 292 Sep 26 14:07 993751307168383758

My configuration is:

engine {
  filesystems {
    gcs {
      auth = "application-default"
    }
    http {}
  }
}

backend {
  providers {
    Local {
      config {
      filesystems {
        http { }
      }
      }
    }
  }
}

Am I doing anything wrong with my configuration or setup that I could tweak? Thanks so much for any pointers/suggestions.

chapmanb avatar Oct 01 '18 09:10 chapmanb

The hash failures are expected with http inputs and should not be the cause of your workflow failure. Also we don't currently support http in engine filesystems. Do you see any other error messages than might provide some insight into what's happening?

mcovarr avatar Oct 01 '18 16:10 mcovarr

Thanks much for the helping with debugging on this.

Beyond the hash failure from Cromwell the other errors I get are all from the workflow itself due to not preserving the original file names. The numerical hashes for files get passed directly into the downstream tools, stripping off any extensions or other identifying information. This results in tool confusion, like tabix can't tell a file wasn't already gzipped:

ValueError: Unexpected tabix input: /home/chapmanb/drive/work/cwl/test_bcbio_cwl/gcp/cromwell_work/cromwell-executions/main-somatic.cwl/93ef2d1c-88ee-4dc2-af0a-e0ea86bc785e/call-prep_samples/shard-0/execution/bedprep/cleaned-8539016497173364825.gz

or bwa can't find all the other associated indices:

bwa mem /home/chapmanb/drive/work/cwl/test_bcbio_cwl/gcp/cromwell_work/cromwell-executions/main-somatic.cwl/93ef2d1c-88ee-4dc2-af0a-e0ea86bc785e/call-alignment/shard-1/wf-alignment.cwl/96d7b606-e0fe-4305-a586-e0fc4acf76f8/call-process_alignment/shard-0/inputs/1628767813 [...]

[E::bwa_idx_load_from_disk] fail to locate the index files

Is it expected to lose the original input file names when passing through the pipeline. A lot of tools are sensitive to these and this might be the underlying issue.

Regarding the configuration, without http {} in under engine -> filesystems I get a complaint about it not being supported, even with http {} under backend -> providers -> Local -> config -> filesystems:

java.lang.IllegalArgumentException: Either https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa exists on a filesystem not supported by this instance of Cromwell, or a failure occurred while building an actionable path from it. Supported filesystems are: LinuxFileSystem. Failures: LinuxFileSystem: Cannot build a local path from https://storage.googleapis.com/bcbiodata/test_bcbio_cwl/testdata/genomes/hg19/seq/hg19.fa (RuntimeException) Please refer to the documentation for more information on how to configure filesystems: http://cromwell.readthedocs.io/en/develop/backends/HPC/#filesystems
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor.cromwell$engine$workflow$lifecycle$materialization$MaterializeWorkflowDescriptorActor$$workflowInitializationFailed(MaterializeWorkflowDescriptorActor.scala:211)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:181)
        at cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anonfun$2.applyOrElse(MaterializeWorkflowDescriptorActor.scala:176)

I was trying to lift off how things were done with the Google/gcp resolution so added it in there to fix this issue. Is there a different configuration approach I should be using?

Thanks again for helping with this.

chapmanb avatar Oct 01 '18 17:10 chapmanb

Hi Brad

In addition to the http entry in the backend filesystems config the http filesystem also needs to be defined system-wide. Cromwell's reference.conf defines this already so as long as that's being pulled into your configuration and you're not overriding the filesystems you should be set.

mcovarr avatar Oct 01 '18 20:10 mcovarr

Thanks for this help with the configuration. I'm not intentionally overwriting the global filesystem, but I don't have an explicit import of reference.conf. Do I need to have an import like we do for application? Do you spot anything else I might be doing wrong?

https://gist.github.com/chapmanb/72c6bf2d8282412b252f6192968b17cf

I appreciate all the help debugging this.

chapmanb avatar Oct 02 '18 13:10 chapmanb

Within your http filesystem, I suspect you need enabled: true, ala https://github.com/broadinstitute/cromwell/blob/35_hotfix/core/src/main/resources/reference.conf#L349

cjllanwarne avatar Oct 04 '18 18:10 cjllanwarne

@cjllanwarne Did you try that or are you making an educated guess? I haven't yet seen anyone need to do that, although AFAIK (educated guess myself) the http stuff isn't properly wired into the engine level stuff at all. My suspicion, having run into something similar myself, is that this is a "doesn't really work with CWL" issue, but could be wrong there

geoffjentry avatar Oct 05 '18 09:10 geoffjentry

Chris, thanks for the idea. I tried this and unfortunately had the same issue. From the behavior it looks like http is working in that the files get downloaded, but they don't get proper naming with numerical names instead of the expected file names. This disconnect seems to be what causes issues when passing these on to the CWL tools.

chapmanb avatar Oct 05 '18 17:10 chapmanb

Hi @chapmanb - sorry for the delay in responding here.

I was able to get http inputs to work in CWL against a default (ie no custom config specified) instance of Cromwell in server mode. The test case I used is in the linked PR (#4392)

I wonder whether you could confirm:

  • Whether this test case works for you, and if so:
    • Is your use of HTTP inputs different somehow?
    • How can I enhance my test case to cover whatever is different?
  • Or, whether this test case does not work for you, and if so:
    • We might try to work out what is different between your configuration and the default which might be breaking things

cjllanwarne avatar Nov 15 '18 21:11 cjllanwarne

Chris; Thanks for working on this and for the test case to iterate with. This example does work for me in the sense that it generates an md5sum, but also demonstrates the underlying issue I'm having with https inputs. I also get them downloaded and staged into my pipeline, but the file names get mangled into random download number. md5sum is cool with this, but many of my real tasks fail because the expected file extensions and associated secondary file extensions get lost with the random file names.

Here's the example output I get from running this that demonstrates the file naming issue:

/usr/bin/md5sum '/home/chapmanb/tmp/cromwell/cromwell_work/cromwell-executions/main-http_inputs.cwl/093e2835-e4cc-4731-9248-88d74dec0977/call-sum/inputs/1515144/1710814112361209342' | cut -c1-32

This input should be called jamie_the_cromwell_pig.png but instead gets a long number attached to it. Is it possible to preserve initial file names with https like happens with other filesystem types?

In terms of the test cases, it would be great if it also checked that the file extension and name get preserved.

Thanks again for looking at this.

chapmanb avatar Nov 16 '18 15:11 chapmanb

Is there a fix in the latest development for this issue? I'm still stuck on this so not sure if I missed something.

chapmanb avatar Nov 27 '18 00:11 chapmanb

I was wondering the same. @rebrrown1395 it’s not obvious why this issue was closed and as it’s from an external user it should be explained prior to closing

geoffjentry avatar Nov 27 '18 00:11 geoffjentry

@geoffjentry and @chapmanb, my apologies this wasn't supposed to be closed!

rebrown1395 avatar Nov 27 '18 14:11 rebrown1395

Hello! Is there an update or workaround on this? I am experiencing the same issue

cnizo avatar May 14 '21 17:05 cnizo