galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

[WIP] Link or copy inputs to working directory if inputs_to_working_directory is 'link' or 'copy'

Open natefoo opened this issue 3 years ago • 7 comments

Rather than fix every tool that writes to its inputs directory, we can just symlink inputs to a subdir of the job directory. Bonus features:

  • Inputs are symlinked with their dataset's extension!
  • You can set this per-job-destination/environment (e.g. per @pvanheus this could be used to copy inputs for Shovill, which canonicalizes all symlinks in its input paths)

Stuff that should be done/tested before merging:

  • [ ] Input extra files paths are not addressed
  • [ ] Container volume expansion magic might need some work - I guess we should make it writable (even though the fact that it shouldn't be writable is what this change addresses)?
  • [ ] Pulsar??
  • [x] Tests

How to test the changes?

(Select all options that apply)

  • [x] I've included appropriate automated tests.
  • [ ] This is a refactoring of components with existing test coverage.
  • [ ] Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]
  • [x] What are tests?

License

  • [x] I agree to license these contributions under Galaxy's current license.
  • [x] I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].

natefoo avatar Sep 10 '21 18:09 natefoo

Whoops, didn't realize those tools would be run as part of the regular tool tests.

natefoo avatar Sep 10 '21 21:09 natefoo

* inputs are symlinked with their dataset's extension!

This can cause problems. As we, unfortunately, have for example this as extension fastqsanger - but it should be fastq.

I'm wondering if this should be a tool parameter. Where you can indicate that all files are linked/copied into the working-dir. We have a few problems with those that make writing tools harder.

  • symlinks as addressed here
  • symlinking of metadata files --> BAI
  • fixing file-extensions
  • input names --> we do a lot of sanatizing and bash tricks to remove whitespace and strange characters from dataset names (element-identifiers)

I'm wondering if a data parameter like <data nam="foo" link_to_wd="__sanatized_element_id__.fastq" ...> could help.

bgruening avatar Sep 11 '21 07:09 bgruening

* inputs are symlinked with their dataset's extension!

This can cause problems. As we, unfortunately, have for example this as extension fastqsanger - but it should be fastq.

Yes, but this proposal doesn't make those cases worse, does it? A tool that doesn't accept .fastqsanger files will likely not accept .dat files either. If this gets merged, many wrappers could be simplified. A few would need to do renaming, but that's not more complicated than now (unless the tool has issues with two levels of symlinking). In general, you would get much more readable command lines with the proposal.

Two potential issues:

  • There could be tools that refuse to work with symlinks, but want real files - hopefully these are really rare cases.
  • Tools might follow symlinks and still work on the real file. So it may still be good to have planemo test check that nothing's written to the real file's directory?

Maybe the real path to every input file could be made accessible via an attribute to the wrapper, e.g. via input_file.realpath or similar, to provide a way to bypass the symlink if a tool really needs to?

wm75 avatar Sep 13 '21 06:09 wm75

I'm wondering if a data parameter like <data nam="foo" link_to_wd="__sanatized_element_id__.fastq" ...> could help.

I think this would be awesome for tool authors, but we also need this global default. This makes a lot of sense to me.

mvdbeek avatar Sep 13 '21 09:09 mvdbeek

Yes, but this proposal doesn't make those cases worse, does it? A tool that doesn't accept .fastqsanger files will likely not accept .dat files either.

Right. And I agree that providing more options to tool authors would be good, but I think this is a good Galaxy-wide first step. Although it is worth pointing out that until we make this option the default, tool authors cannot rely on it.

Symlinking metadata files is definitely something we should do and it's an oversight that I didn't include it here.

natefoo avatar Sep 13 '21 15:09 natefoo

-0 - just use Pulsar rather than making jobs stuff for more complicated on this end. It also doesn't tie into objectstores - I feel like if these files exist the objectstore should staging data right to those inputs right rather than copying them afterward. There is all this complexity we want to implement around that - and having a second way of doing all that and that is less structured than Pulsar makes me anxious.

jmchilton avatar Sep 13 '21 16:09 jmchilton

As the person most often championing Pulsar's approach to isolation and the idea that Pulsar should become Galaxy's primary job execution engine, I probably should have been the one to make that point :smile:. This was basically a question of "how easily could we address this major recurring problem/nuisance for tool authors?" for which the answer ended up being "really easily." But I agree with what you said. I think until Pulsar is the way that 95% of Galaxy jobs are run rather than the other way around, though, this is maybe not a bad thing?

natefoo avatar Sep 14 '21 16:09 natefoo