miniwdl icon indicating copy to clipboard operation
miniwdl copied to clipboard

Cromwell use_relative_output_paths equivalent for miniwdl

Open rhpvorderman opened this issue 2 years ago • 2 comments

Hi, thank you for all the help incorporating all the features for miniwdl-slurm again. This time I want to discuss a convenience feature that we have added to cromwell: use_relative_output_paths, which is described here.

This option does the following: my_final_workflow_outputs_dir/ ~~MyWorkflow/af76876d8-6e8768fa/call-MyTask/execution/~~ output_of_interest. Cromwell had quite a annoyingly deeply nested output directory. While in miniwdl this is not so much the case, there is still some nesting going on:

  • When a task creates an output called FastqOut: FastqOut/my_out.fastq
  • When a task creates an array called reportFiles:
    • reportFiles/0/my_text_report.txt
    • reportFiles/1/my_pdf_report.pdf

In snakemake there is much more control over where the output files end up. I can make a directory reports and put all the report files in there, same for a directory called fastq. I can technically do this in WDL, but unless I use absolute paths, I am at mercy of the execution engine. As shown in this functional test in pytest-workflow miniwdl and cromwell both put files from the same workflow using the same inputs in different output directories. (Cromwell uses use_relative_output_paths in this example, otherwise testing would have been impossible thanks to its random UUIDs in path names.)

This is annoying because a good directory structure helps navigating through the analysis results. It also makes it easier to write tests. I am thinking to simply add a similar option to cromwell here in miniwdl, and make it configurable in the file_io section. I am happy to write all the code for that, but of course I do not want to implement random features without discussing them first.

rhpvorderman avatar Sep 21 '22 14:09 rhpvorderman

@rhpvorderman Sure, the current out/ directory structure is meant to be programmatically navigable (so that a script that doesn't have or want to parse outputs.json can still navigate the WDL outputs); but I like the idea of having an option to flatten it out. The relevant function is task.py:link_outputs which is used for both task and workflow outputs.

I imagine you could branch into a much simpler version of that function which could use WDL.Value.rewrite_env_paths to iterate over the filesystem paths in the outputs environment (you wouldn't need the custom recursion currently in link_outputs since you wouldn't be creating the whole subdirectory tree).

Three detail considerations:

  1. Handling both File and Directory outputs
  2. The output_hardlinks configuration option that controls whether out/ contains symlinks or hardlinks
  3. What to do if there's a basename collision? (I think the Cromwell docs you linked say it just throws in that case, which sounds fine to me if this is an opt-in feature)

mlin avatar Sep 22 '22 05:09 mlin

Thanks for the pointers and your quick answer. In case of 3 I will throw an exception. It is a bit annoying for the user that the workflow fails at the last possible moment, but as you say this should not be a big problem when it is configurable.

rhpvorderman avatar Sep 23 '22 09:09 rhpvorderman