cromwell icon indicating copy to clipboard operation
cromwell copied to clipboard

read_x() function and File object resolution in task output AWS Batch S3

Open microbioticajon opened this issue 2 years ago • 0 comments

Hi Guys.

We are running pipelines through Cromwell on AWS Batch using S3 and have noticed some behaviour we didn't initially expect.

We have a task that has quite a significant setup cost. As such we want to process a number of samples through this task rather than instantiating the task for every sample. We can then parallelise this task to process batches of samples.

The task takes an Array of structs:

struct Sample {
  String id
  File file1
  File file2
}

The struct is serialised to the task using write_json() and the tool consumes the resulting json before processing the samples one after the other. It is important that the output files can be matched back to their original inputs via the supplied id. The tool outputs a single file per sample to a directory and produces a reports.json that looks like:

[
  {
    "id": "1"
    "file": "outputs/report.txt"
  },
 ...
]

I was hoping we could use the read_json() function to parse the output.json into an array of the following struct:

struct Report {
  String id
  File file
}

and pass this to the next task (or drive a scatter) in the pipeline. However, the File objects parsed in this manner are not resolved to actual task outputs and neither have their address updated or delocalised at the end of the task.

Conceptually, it seems like resolving Files within read_* generated structs would be handled the same way as raw File outputs. However, looking at how the delocalisation occurs in the Cromwell task script I understand why this would be difficult to implement.

The wdl spec dose not specifically state that File outputs generated this way will be respected but then again it does not say that they won't.

a) Could I put forward a feature request for the spec to detect File outputs generated from read_* functions and delocalise them? b) Or put a note in the wdl/Cromwell spec that File objects generated from read_*() functions may not be detected in the output?

Thanks, Jon

microbioticajon avatar Jul 07 '22 12:07 microbioticajon