miniwdl
miniwdl copied to clipboard
Cromwell use_relative_output_paths equivalent for miniwdl
Hi, thank you for all the help incorporating all the features for miniwdl-slurm again. This time I want to discuss a convenience feature that we have added to cromwell: use_relative_output_paths, which is described here.
This option does the following: my_final_workflow_outputs_dir/ ~~MyWorkflow/af76876d8-6e8768fa/call-MyTask/execution/~~ output_of_interest. Cromwell had quite a annoyingly deeply nested output directory. While in miniwdl this is not so much the case, there is still some nesting going on:
- When a task creates an output called
FastqOut
:FastqOut/my_out.fastq
- When a task creates an array called
reportFiles
:-
reportFiles/0/my_text_report.txt
-
reportFiles/1/my_pdf_report.pdf
-
In snakemake there is much more control over where the output files end up.
I can make a directory reports
and put all the report files in there, same for a directory called fastq
. I can technically do this in WDL, but unless I use absolute paths, I am at mercy of the execution engine. As shown in this functional test in pytest-workflow miniwdl and cromwell both put files from the same workflow using the same inputs in different output directories. (Cromwell uses use_relative_output_paths in this example, otherwise testing would have been impossible thanks to its random UUIDs in path names.)
This is annoying because a good directory structure helps navigating through the analysis results. It also makes it easier to write tests. I am thinking to simply add a similar option to cromwell here in miniwdl, and make it configurable in the file_io section. I am happy to write all the code for that, but of course I do not want to implement random features without discussing them first.
@rhpvorderman Sure, the current out/
directory structure is meant to be programmatically navigable (so that a script that doesn't have or want to parse outputs.json can still navigate the WDL outputs); but I like the idea of having an option to flatten it out. The relevant function is task.py:link_outputs
which is used for both task and workflow outputs.
I imagine you could branch into a much simpler version of that function which could use WDL.Value.rewrite_env_paths
to iterate over the filesystem paths in the outputs environment (you wouldn't need the custom recursion currently in link_outputs
since you wouldn't be creating the whole subdirectory tree).
Three detail considerations:
- Handling both File and Directory outputs
- The
output_hardlinks
configuration option that controls whetherout/
contains symlinks or hardlinks - What to do if there's a basename collision? (I think the Cromwell docs you linked say it just throws in that case, which sounds fine to me if this is an opt-in feature)
Thanks for the pointers and your quick answer. In case of 3 I will throw an exception. It is a bit annoying for the user that the workflow fails at the last possible moment, but as you say this should not be a big problem when it is configurable.