miniwdl
miniwdl copied to clipboard
Add a use_relative_output_paths parameter that flattens output directories.
Motivation
#604
Approach
As suggested, modify the link_outputs function. I wrote a small map_paths_relative function that is much simpler and for File
and Directory
types just defers to map_paths. That way the handling of hardlinks etc. is correctly handled and there is no code duplication. I added a few small code snippets to the File
/Directory
section of map_paths to handle the relativity part.
I added a test, and also tested using BioWDL's germline-DNA pipeline manually to verify that multiple nested subworkflows do not influence the result.
Checklist
One question: I see that all test files are properly cleaned up. Except the files that I added to runner.t
. How can I remedy this?
Pull Request Test Coverage Report for Build 3128243776
- 9 of 30 (30.0%) changed or added relevant lines in 1 file are covered.
- 1 unchanged line in 1 file lost coverage.
- Overall coverage decreased (-0.2%) to 94.921%
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
---|---|---|---|
WDL/runtime/task.py | 9 | 30 | 30.0% |
<!-- | Total: | 9 | 30 |
Files with Coverage Reduction | New Missed Lines | % |
---|---|---|
WDL/runtime/backend/docker_swarm.py | 1 | 91.45% |
<!-- | Total: | 1 |
Totals | |
---|---|
Change from base Build 3080752069: | -0.2% |
Covered Lines: | 7176 |
Relevant Lines: | 7560 |
💛 - Coveralls
@rhpvorderman Apologies for the long delay picking up on this, been moving house and other interruptions.
On first reading, two comments on the code
- The search for directory component names "work" and "out" makes me a little nervous, as the user's PWD could include directories so named, beyond our control. When we're finishing a task, we should be able to know / pass in the task working directory (
os.path.join(run_dir, "work")
) and useos.path.relpath()
from there. The workflow-level outputs are trickier, let me think about this. - I think we should be able to use
WDL.Value.rewrite_env_paths
instead ofmap_paths_relative()
and shorten/simplify the diff. The existing logic needs a custom recursivemap_paths()
because it's creating subdirectories to reflect the nested structure of the output values. But we don't care about the structure for relative output paths.
Regarding test file cleanup, all the files created by runner.t
are supposed to go into a temp directory under /tmp/miniwdl_runner_tests_XXX/
where XXX is some random string and actually they don't get cleaned up (except of course when the machine restarts and empties /tmp
as usual). At the top of the script it creates the temp directory and cd's into it. Where are you seeing other stray files appear?
- The search for directory component names "work" and "out" makes me a little nervous, as the user's PWD could include directories so named, beyond our control. When we're finishing a task, we should be able to know / pass in the task working directory (
os.path.join(run_dir, "work")
) and useos.path.relpath()
from there. The workflow-level outputs are trickier, let me think about this.
I agree this is a bit of a hack that should be solved better. I'd rather not hardcode this, in case for some reason the name changes from "work" to "execution" for example and then suddenly a lot of code breaks.
I think we should be able to use WDL.Value.rewrite_env_paths instead of map_paths_relative() and shorten/simplify the diff. The existing logic needs a custom recursive map_paths() because it's creating subdirectories to reflect the nested structure of the output values. But we don't care about the structure for relative output paths. [emphasis mine]
But I really do care about that structure. Otherwise I will get a lot of collissions and it is quite nice to have it in an ordered fashion and to be able to set that in the pipeline. Nextflow has this with "PublishDir" and snakemake can do it too. Or am I misreading you here? I am sorry in that case. I am all for reusing existing code and not making it more complex than it needs to be.
Regarding test file cleanup, all the files created by runner.t are supposed to go into a temp directory under /tmp/miniwdl_runner_tests_XXX/ where XXX is some random string and actually they don't get cleaned up (except of course when the machine restarts and empties /tmp as usual). At the top of the script it creates the temp directory and cd's into it. Where are you seeing other stray files appear?
In the current working directory, at least when I ran it last time. I will check it again. At least the test code tests the desired result correctly. So there is at least one thing in this PR that does not have to be massively refactored :sweat_smile: .