miniwdl Add a use_relative_output_paths parameter that flattens output directories.

Add a use_relative_output_paths parameter that flattens output directories.

Open rhpvorderman opened this issue 1 year ago • 4 comments

Motivation

#604

Approach

As suggested, modify the link_outputs function. I wrote a small map_paths_relative function that is much simpler and for File and Directory types just defers to map_paths. That way the handling of hardlinks etc. is correctly handled and there is no code duplication. I added a few small code snippets to the File/Directory section of map_paths to handle the relativity part.

I added a test, and also tested using BioWDL's germline-DNA pipeline manually to verify that multiple nested subworkflows do not influence the result.

Checklist

[x] Add appropriate test(s) to the automatic suite
[x] Use make pretty to reformat the code with black
[x] Use make check to statically check the code using Pyre and Pylint
[x] Send PR from a dedicated branch without unrelated edits
[x] Ensure compatibility with this project's MIT license

Sep 26 '22 13:09 rhpvorderman

One question: I see that all test files are properly cleaned up. Except the files that I added to runner.t. How can I remedy this?

Sep 26 '22 13:09 rhpvorderman

Pull Request Test Coverage Report for Build 3128243776

9 of 30 (30.0%) changed or added relevant lines in 1 file are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.2%) to 94.921%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
WDL/runtime/task.py	9	30	30.0%
<!--	Total:	9	30

Files with Coverage Reduction	New Missed Lines	%
WDL/runtime/backend/docker_swarm.py	1	91.45%
<!--	Total:	1

Totals
Change from base Build 3080752069:	-0.2%
Covered Lines:	7176
Relevant Lines:	7560

💛 - Coveralls

Oct 02 '22 09:10 coveralls

@rhpvorderman Apologies for the long delay picking up on this, been moving house and other interruptions.

On first reading, two comments on the code

The search for directory component names "work" and "out" makes me a little nervous, as the user's PWD could include directories so named, beyond our control. When we're finishing a task, we should be able to know / pass in the task working directory (os.path.join(run_dir, "work")) and use os.path.relpath() from there. The workflow-level outputs are trickier, let me think about this.
I think we should be able to use WDL.Value.rewrite_env_paths instead of map_paths_relative() and shorten/simplify the diff. The existing logic needs a custom recursive map_paths() because it's creating subdirectories to reflect the nested structure of the output values. But we don't care about the structure for relative output paths.

Regarding test file cleanup, all the files created by runner.t are supposed to go into a temp directory under /tmp/miniwdl_runner_tests_XXX/ where XXX is some random string and actually they don't get cleaned up (except of course when the machine restarts and empties /tmp as usual). At the top of the script it creates the temp directory and cd's into it. Where are you seeing other stray files appear?

Oct 11 '22 03:10 mlin

The search for directory component names "work" and "out" makes me a little nervous, as the user's PWD could include directories so named, beyond our control. When we're finishing a task, we should be able to know / pass in the task working directory (os.path.join(run_dir, "work")) and use os.path.relpath() from there. The workflow-level outputs are trickier, let me think about this.

I agree this is a bit of a hack that should be solved better. I'd rather not hardcode this, in case for some reason the name changes from "work" to "execution" for example and then suddenly a lot of code breaks.

I think we should be able to use WDL.Value.rewrite_env_paths instead of map_paths_relative() and shorten/simplify the diff. The existing logic needs a custom recursive map_paths() because it's creating subdirectories to reflect the nested structure of the output values. But we don't care about the structure for relative output paths. [emphasis mine]

But I really do care about that structure. Otherwise I will get a lot of collissions and it is quite nice to have it in an ordered fashion and to be able to set that in the pipeline. Nextflow has this with "PublishDir" and snakemake can do it too. Or am I misreading you here? I am sorry in that case. I am all for reusing existing code and not making it more complex than it needs to be.

Regarding test file cleanup, all the files created by runner.t are supposed to go into a temp directory under /tmp/miniwdl_runner_tests_XXX/ where XXX is some random string and actually they don't get cleaned up (except of course when the machine restarts and empties /tmp as usual). At the top of the script it creates the temp directory and cd's into it. Where are you seeing other stray files appear?

In the current working directory, at least when I ran it last time. I will check it again. At least the test code tests the desired result correctly. So there is at least one thing in this PR that does not have to be massively refactored :sweat_smile: .

Oct 11 '22 14:10 rhpvorderman

miniwdl miniwdl copied to clipboard

Add a use_relative_output_paths parameter that flattens output directories.

Motivation

Approach

Checklist

Pull Request Test Coverage Report for Build 3128243776

💛 - Coveralls

miniwdl
miniwdl copied to clipboard