galaxy [RFC] Link datasets w/ extension into working directory

Another PR in the series of "It's possible, but is it a good idea?"

I'd appreciate any feedback (particularly about these points):

should we do the symlinking on the framework level at all ?
should we do this by default (as in this PR) ?
or should we go with a more fine-grained control (i.e. allow tool authors to specify a tag on input files like suffix="fastq.gz" based on which we would link files in with a specifc ending) ?
Are there situation where we just can't do symlinking (perhaps pulsar with path rewriting -- haven't tested this)?

In the context of #3145, it would be nice if tool wrapper authors could rely on the datatype extension for their tools. Many tools require a specific filename extension. Currently, tool authors need to symlink the input files with the correct extension in the tool wrapper.

With this commit galaxy will symlink input files into the current working directory as input_<dataset_id>.<datatype_extension>, if an extension is defined, the datatype is not composite and the tool does not make use of parallelism.

I believe this makes writing wrappers easier for tool authors, and should also prevent resorting to cheetah if a tool accepts different input formats, where each format has a specific extension (e.g fastq, fastq.gz. fastq.bz2 etc).

Nov 25 '16 17:11 mvdbeek

I wish I had a better memory, I think at some point we discussed a similar idea with a param-extension like link_as="foo.fastq" or link_as="foo.{ext}" which would be then automatically linked/copied into the working dir. I'm not sure why this was never implemented :(

Nov 25 '16 18:11 bgruening

I wish I had a better memory, I think at some point we discussed a similar idea with a param-extension like link_as="foo.fastq" or link_as="foo.{ext}" which would be then automatically linked/copied into the working dir.

I think that would be fairly easy to implement/modify (goes in a similar vein as the "suffix" tag) -- on top of the current PR that would provide a default, or not using a default suffix A minor point is that just controlling the suffix may be beneficial when a tool uses collection input (where I think you could have a collision by forcing link_as="foo.fastq" alone). At least that's why I'm using "<dataset_id>." for now. How is this actually being handled in CWL ? ping @mr-c @jmchilton

Nov 25 '16 19:11 mvdbeek

Hello @mvdbeek -- in CWL v1.0 we haven't optimized for this use case but it is possible.

Here is one method:

Use the InitialWorkDirRequirement to pre-populate the working directory (which is otherwise empty) using some subset of the input files via an array of Dirents whose entry field points to one of the input files and whose entryname field overrides the input filename.

Example:

class: CommandLineTool
cwlVersion: v1.0
baseCommand: [ls, -l]
requirements:
  InitialWorkDirRequirement:
    listing:
      - entryname: $(inputs.srcfile.nameroot).fastq
        entry: $(inputs.srcfile)
inputs:
  srcfile: File
outputs:
  listing: stdout

This is entirely too verbose and I've opened https://github.com/common-workflow-language/common-workflow-language/issues/351 to discuss alternatives for future versions of CWL. You are very welcome to make suggestions there!

Nov 26 '16 10:11 mr-c

should we do the symlinking on the framework level at all ?

Yes, absolutely yes.

should we do this by default (as in this PR) ?

Probably not for older tools - I've seen wrappers depend on a .dat extension unfortunately. I'd be happy to say yes for newer profile="17.01" or newer tools though.

or should we go with a more fine-grained control (i.e. allow tool authors to specify a tag on input files like suffix="fastq.gz" based on which we would link files in with a specific ending) ?

We should implement something to allow customizing this. I'm not sure of the exact syntax but I can sit down and think about it.

Are there situation where we just can't do symlinking (perhaps pulsar with path rewriting -- haven't tested this)?

Don't let Pulsar hold you back - I would just do this and we can optimize for Pulsar later on.

As for the implementation - I feel like we should put it in the job script generator not the tool command-line evaluation stuff - it feels like that is where it belongs.

Nov 28 '16 00:11 jmchilton

@mvdbeek can you not have the element_identifier as part of the link_as="foo_{element_identifier}.{ext}"?

Dec 03 '16 13:12 bgruening

That's what i'm working towards, yes. But looking up the tool_inputs and matching the datasets correctly proved to be challenge ... I may need to modify the Job or JobWrapper class ...

Dec 03 '16 18:12 mvdbeek

This is marked as WIP so I am pushing it to 18.05.

Jan 02 '18 16:01 martenson

@mvdbeek I'd like to revive this, what do you think it needs?

Oct 17 '18 15:10 natefoo

I think the most glaring omission in the current form is tests that verify all is working correctly. I am not sure for instance how this will work with nested collections that have identical identifiers at the dataset instance level. Then I think it would be good to enable a syntax like the one @mr-c described, since I can see this work for collections as well. And finally we may also do something about output datasets too, since many tools for instance apply logic based on the filename extension (i.e .gz will automatically compress).

Although we already have from_work_dir that enables this pattern, but this unfortunately can't make use of element_identifier or other cheetah variables.

Oct 18 '18 07:10 mvdbeek

Ok. I'm thinking about reworking the job directory structure first and then I'll revisit this. Thanks!

Oct 19 '18 15:10 natefoo

galaxy galaxy copied to clipboard

[RFC] Link datasets w/ extension into working directory

galaxy
galaxy copied to clipboard