wdl icon indicating copy to clipboard operation
wdl copied to clipboard

Scatter over multiple items with literate names (as syntactic sugar)

Open jtratner opened this issue 5 years ago • 7 comments

While the pair syntax is helpful, if you have three or more arrays you're trying to zip, the syntax gets pretty messy.

For example, right now I can do:

scatter(pair in zip(zip(sample_bams, sample_metrics), sample_vcfs))) {
  call task { input: bam=pair.left.left, metric=pair.left.right, vcf=pair.right }
}

But I'd much prefer to be able to say:

scatter(bam, metric, vcf in zip(sample_bams, sample_metrics, sample_vcfs)) {
    call task{ input: bam=bam, metric=metric, vcf=vcf}
}

If the concern is scoping rules, perhaps it could be the equivalent to:

scatter(pair in zip(zip(sample_bams, sample_metrics), sample_vcfs)) {
    FIle bam = pair.left.left
    FIle metric = pair.left.right
    File vcf = pair.right
    call task { input: bam=bam, metric=metric }
}

jtratner avatar Jan 04 '19 08:01 jtratner

I am not sure how you are generating the upstream information, but would it be possible to use a struct to encapsulate all of the information you are trying to group together?

IE:


struct SampleInfo {
    File bam
    File metric
    File vcf
}

workflow myWorkflow {
  Array[SampleInfo] sampleInfo

  scatter(info in SampleInfo){
     call task { input: bam=info.bam, metric=info.metric, vcf=info.vcf }
  }
}

patmagee avatar Jan 08 '19 16:01 patmagee

Pair was never intended to be the final word. The origin of it was really the need for a tuple construct, but the driving use case at the time only needed tuple2 and the implementation of a Pair in Cromwell was a lot simpler than a arbitrarily sized tuple structure so that's what we did. We figured that we'd always have backwards compatibility if tuple was ever realized by just making Pair an alias for a tuple2

So I'd also consider if an arbitrary tuple would make sense, and if so, if there's something that's providing that's not provided by a struct (and/or just syntactic sugar over a struct)

geoffjentry avatar Jan 08 '19 18:01 geoffjentry

Thanks for the response!

I am not sure how you are generating the upstream information, but would it be possible to use a struct to encapsulate all of the information you are trying to group together?

I'm not sure :) . Right now this is the combination of a number of calls (slightly contrived example, but gets at the root of it):

scatter(fastq in fastqs) {
    call bwa { input: fastq=fastq}
}
scatter (bam in bwa.bam) {
   call filter_bam { input:  bam=bam}
}
call calc_metrics { input: bams=filter_bam.bam}

scatter (bam, filtered_bam, metric in zip(bwa.bam, filter_bam.out_bam, calc_metrics.metric)) {
 call ...
}

The reason I wouldn't just put this all in one scatter is because I have a normalization step that takes in all fastqs (or I might want to call out to some subworkflows). Regardless I end up with a bunch of lists and I'm not sure how to convert that into structs.

Does that make sense?

So I'd also consider if an arbitrary tuple would make sense, and if so, if there's something that's providing that's not provided by a struct (and/or just syntactic sugar over a struct)

Arbitrary tuple makes sense in the sense that it's pretty natural to want to iterate over a set of arrays that are all have items in the same order (esp because different scatters still guarantee output in same order)

jtratner avatar Jan 15 '19 04:01 jtratner

@jtratner One way you could go about doing this would be to build a series of structs or even using a Map[String,File] type, (or one struct with optional parameters). The struct approach may be slightly dependent on struct literals working according to their new definition, so I will provide an answer using maps:

# Approach using Maps which can be implemented now
scatter(fastq in fastqs) {
    call bwa { input: fastq=fastq}
    Map[String,File] intermediate_1 = {"fastq":intermediate["fastq"],"bam":bam}
}
scatter (intermediate in intermediate_1) {
   call filter_bam { input:  bam=intermediate["bam"]}
   Map[String,File] intermediate_2 =  {"fastq":intermediate["fastq"],"bam":intermediate["bam"],"filter_bam": filter_bam.bam}
}

# filter_bam is still defined so you can still call it and receive a `Arrary[File]` type
call calc_metrics { input: bams=filter_bam.bam}


scatter (intermediate in intermediate_2) {
  File bam = intermediate["bam"]
  File fastq = intermediate["fastq"]
  File filt_bam = intermediate["filter_bam"]
 call ...
}

patmagee avatar Jan 18 '19 18:01 patmagee

Is it possible to extract a key from an array of structs? I.e. get an array of File bams from the struct SampleInfo above? Or can you coerce a zip of arrays to an array of structs? Sorry just realized zip can only zip 2 arrays max. Makes sense since on files have an index and not another auxiliary file.

ghost avatar Feb 08 '21 01:02 ghost

I agree with @geoffjentry - Tuple would be more generally useful than Pair. We might want to look at deprecating the later in favor of the former in a future version. Then we could also 1) add tuple destructuring as a language feature, and 2) allow zip to take any number of arguments. The combination would address the OP's request.

jdidion avatar Feb 07 '24 18:02 jdidion

If we switch from Pair to Tuple I'd also get rid of the left/right naming as well. I'm ambivalent about whether to use array-style indexing vs named elements, e.g., t.0 to get the first element.

jdidion avatar Feb 07 '24 18:02 jdidion