wdl
wdl copied to clipboard
Scatter over multiple items with literate names (as syntactic sugar)
While the pair syntax is helpful, if you have three or more arrays you're trying to zip, the syntax gets pretty messy.
For example, right now I can do:
scatter(pair in zip(zip(sample_bams, sample_metrics), sample_vcfs))) {
call task { input: bam=pair.left.left, metric=pair.left.right, vcf=pair.right }
}
But I'd much prefer to be able to say:
scatter(bam, metric, vcf in zip(sample_bams, sample_metrics, sample_vcfs)) {
call task{ input: bam=bam, metric=metric, vcf=vcf}
}
If the concern is scoping rules, perhaps it could be the equivalent to:
scatter(pair in zip(zip(sample_bams, sample_metrics), sample_vcfs)) {
FIle bam = pair.left.left
FIle metric = pair.left.right
File vcf = pair.right
call task { input: bam=bam, metric=metric }
}
I am not sure how you are generating the upstream information, but would it be possible to use a struct
to encapsulate all of the information you are trying to group together?
IE:
struct SampleInfo {
File bam
File metric
File vcf
}
workflow myWorkflow {
Array[SampleInfo] sampleInfo
scatter(info in SampleInfo){
call task { input: bam=info.bam, metric=info.metric, vcf=info.vcf }
}
}
Pair
was never intended to be the final word. The origin of it was really the need for a tuple construct, but the driving use case at the time only needed tuple2
and the implementation of a Pair
in Cromwell was a lot simpler than a arbitrarily sized tuple structure so that's what we did. We figured that we'd always have backwards compatibility if tuple was ever realized by just making Pair
an alias for a tuple2
So I'd also consider if an arbitrary tuple would make sense, and if so, if there's something that's providing that's not provided by a struct (and/or just syntactic sugar over a struct)
Thanks for the response!
I am not sure how you are generating the upstream information, but would it be possible to use a struct to encapsulate all of the information you are trying to group together?
I'm not sure :) . Right now this is the combination of a number of calls (slightly contrived example, but gets at the root of it):
scatter(fastq in fastqs) {
call bwa { input: fastq=fastq}
}
scatter (bam in bwa.bam) {
call filter_bam { input: bam=bam}
}
call calc_metrics { input: bams=filter_bam.bam}
scatter (bam, filtered_bam, metric in zip(bwa.bam, filter_bam.out_bam, calc_metrics.metric)) {
call ...
}
The reason I wouldn't just put this all in one scatter is because I have a normalization step that takes in all fastqs (or I might want to call out to some subworkflows). Regardless I end up with a bunch of lists and I'm not sure how to convert that into structs.
Does that make sense?
So I'd also consider if an arbitrary tuple would make sense, and if so, if there's something that's providing that's not provided by a struct (and/or just syntactic sugar over a struct)
Arbitrary tuple makes sense in the sense that it's pretty natural to want to iterate over a set of arrays that are all have items in the same order (esp because different scatters still guarantee output in same order)
@jtratner One way you could go about doing this would be to build a series of structs or even using a Map[String,File]
type, (or one struct with optional parameters). The struct approach may be slightly dependent on struct literals working according to their new definition, so I will provide an answer using maps:
# Approach using Maps which can be implemented now
scatter(fastq in fastqs) {
call bwa { input: fastq=fastq}
Map[String,File] intermediate_1 = {"fastq":intermediate["fastq"],"bam":bam}
}
scatter (intermediate in intermediate_1) {
call filter_bam { input: bam=intermediate["bam"]}
Map[String,File] intermediate_2 = {"fastq":intermediate["fastq"],"bam":intermediate["bam"],"filter_bam": filter_bam.bam}
}
# filter_bam is still defined so you can still call it and receive a `Arrary[File]` type
call calc_metrics { input: bams=filter_bam.bam}
scatter (intermediate in intermediate_2) {
File bam = intermediate["bam"]
File fastq = intermediate["fastq"]
File filt_bam = intermediate["filter_bam"]
call ...
}
Is it possible to extract a key from an array of structs? I.e. get an array of File bams from the struct SampleInfo above? Or can you coerce a zip of arrays to an array of structs? Sorry just realized zip can only zip 2 arrays max. Makes sense since on files have an index and not another auxiliary file.
I agree with @geoffjentry - Tuple
would be more generally useful than Pair
. We might want to look at deprecating the later in favor of the former in a future version. Then we could also 1) add tuple destructuring as a language feature, and 2) allow zip
to take any number of arguments. The combination would address the OP's request.
If we switch from Pair
to Tuple
I'd also get rid of the left
/right
naming as well. I'm ambivalent about whether to use array-style indexing vs named elements, e.g., t.0
to get the first element.