pydra icon indicating copy to clipboard operation
pydra copied to clipboard

Absolute path, rename, datasink

Open satra opened this issue 4 years ago • 11 comments

@jw-96 and @djarecka it would be good to discuss the implications of absolute path since it takes things outside the cache directory and may not always do what one thought it accomplishes.

This also relates to the notion of his pydra would be implementing datasink and rename.

First, I think there are situations where absolute path could be helpful, such as working on really large multi TB files. However this means putting output outside of the cache with no way to check. What happens if that outside file has been removed and one requests the task results?

Thus I would like to consider that absolute path should almost never be a spec attribute, but a task attribute that controls where the output could be placed after the task is run in an isolated cache directory. This would allow incorporating a sink overlay, for example to output a bids named derivative that may itself require access to certain variables in a workflow.

It would be good to discuss a bit why absolute path was necessary.

satra avatar Jun 25 '21 11:06 satra

@satra - and what do you think it should happen in the dcmi2nii example. The output directory is a mandatowy input field and the output files use the output directory in the template for names. Before absolut_path pr, the files were created properly in the output directory, but the output was returning path from the task's directory (and the files were actually not there).

In the current master, the output fields are set to NOTHING, unless all of the output field has absolut_path attribute, and this is not the perfect solution.

While I was not planning to remove absolute_path completely, I was planning to change the behavior in a way, that all files that use in the template directory from the input have to have "absolut_path" (by definition the input path can not be inside the task's directory). And I was planning to copy the files to task's directory to prevent the issue with removing files (when they are in external directory). This would be the default solution for tasks, and we could have a different behavior for tasks with large files.

What do you think?

djarecka avatar Jun 27 '21 15:06 djarecka

you can take a look at the original implementation in nipype 1, which used copyfile (which also exists in pydra) to link the input directory/files into the task working directory. those semantics should be used in the new task as well, but i don't think the specs are well documented anywhere for task creators.

satra avatar Jun 27 '21 15:06 satra

I was using copyfile when the input files are changed in place, so it's a bit different use case

djarecka avatar Jun 27 '21 18:06 djarecka

isn't there a copyfile=False which is a hardlink/symlink for files that require generating outputs at the location of input. many commands only generate outputs at the location of the input.

satra avatar Jun 27 '21 18:06 satra

but I understand that this is not the case for dcm2niix, it does have in_dir and out_dir: https://github.com/nipype/pydra-dcm2niix/blob/main/pydra/tasks/dcm2niix/utils.py#L18

djarecka avatar Jun 27 '21 19:06 djarecka

i'm not saying that the interface has been created with all the pieces that we had in nipype1: https://github.com/nipy/nipype/blob/master/nipype/interfaces/dcm2nii.py#L311 . regarding the output_directory a callable could have been used to ensure that output_directory is respected.

however, the discussion here is independent of how dcm2niix is implemented or could be implemented.

while absolute_paths can be useful, the use of it to put output somewhere should be avoided and replaced by some other feature that allows placing outputs at some location in addition to the task directory. the pydra concept should be that outputs are produced in the cache directory and then moved elsewhere as necessary. if a task is used in a non-workflow setting then i can see the benefits of returning the final location rather than the in-cache location. but in a dataflow data should not be arbitrarily moved in and out of the cache dir, i think.

satra avatar Jun 27 '21 20:06 satra

The main motivation for adding absolute_path was to support it for the BoshTask. The use case would be tools that generate their own cache / safe files directory, that cannot be changed. Currently, Pydra only checks if the _results.pklz exists. It never checks if any of the output files actually exist.

JeffWigger avatar Jun 28 '21 06:06 JeffWigger

@jw-96 - i think the check you introduced is very useful, so I will keep it, but will suggest some changes to the absolut_path.

djarecka avatar Jun 28 '21 13:06 djarecka

The use case would be tools that generate their own cache / safe files directory, that cannot be changed.

are you suggesting that this is global? as opposed to pydra instructing the tool to use the working directory as cache?

let's consider the use case where a pydra task runs another pydra workflow in a container. by default, one could imagine that the cache directory of the workflow would be in the working directory of the task, but with an ability to optimize this cache to a central location.

It never checks if any of the output files actually exist.

this should happen and should be encoded in the spec. if it did not then it should have been fixed.

the above two use cases do need the ability to check for files in other locations, just like pydra checks different types of cache. but i don't see how an absolute path key is needed for them.

satra avatar Jun 28 '21 13:06 satra

absolute_path only affects where pydra is looking for the output file. The main use case is to retrieve an output file that a program generated at a fixed location.

JeffWigger avatar Jun 29 '21 07:06 JeffWigger

absolute_path only affects where pydra is looking for the output file.

i meant why not leave this up to the task to determine where to look for outputs. even without a special key the task should have awareness of where outputs are created: 1) alongside inputs 2) in the cache directory, or 3) at an absolute location. what benefit does specifying absolute_path provide given the information stored in the inputs?

if pydra, instead of the task, was taking decisions as to where it thought outputs should go that should be addressed, but it seems every task is able to determine for itself which of the three options outputs could go to.

satra avatar Jun 29 '21 12:06 satra