wdl Add List Comprehension to the SPEC

A common pattern that has evolved in WDL is the need to convert a list of type A into a list of type B. Most programming languages support list comprehension to accomadate this, but WDL has no built in method for it. Coincidentally we have already implemented a similar pattern with ternary operators (if-then-else) but have not gone as far as list comprehension

Currently, you can fake list comprehension using a scatter

List[Sample] samples

scatter(sample in samples){
  String sample_names = sample.name
}

# List[String] sample_names

While this is technically a form of list comprehension, it has a few drawbacks

Additional boilerplate
Assignment to an extra non list variable may be confusing (we should address this in the future)
Cannot be used within expressions

Support List Comprehension Directly

The simplest strategy we can take is to support list comprehension directly within WDL, potentially following the Python syntax

List[Sample] samples
List[String] sample_names = [sample.name for sample in samples]

One unintended consequence (for better or worse) of the above example is that it adds for as a reserved KW to the specification, without introducing a for construct. I think this is an acceptable tradeoff, but if we wanted to keep with existing semantics we could do something

List[String] sample_names = [sample.name scatter(sample in samples)]
List[String] sample_names = [scatter(sample in samples) sample.name]
# maybe there is a different mode of scatter
List[String] sample_names = scatter(sample in samples) { sample.name }

An additional benefit to list comprehension is the ability to easily add filtering

List[String] sample_names = [sample.name for sample in samples if sample.type == "blood"]
List[String] sample_names = [sample.name scatter(sample in samples) if sample == "blood"]
List[String] sample_names = [sample.name scatter(sample in samples if sample == "blood")]
List[String] sample_names = scatter(sample in samples if sample == "blood" ) { sample.name }

May 23 '24 17:05 patmagee

It does not resemble python in the same way, But I personally kind of like [sample.name scatter(sample in samples if sample == "blood")], because it would potentially allow the following pattern:

# Filtered scatters without reassignment!
scatter(sample in sample if sample == "blood"){
  call foo
  call bar
  call biz
}

May 23 '24 17:05 patmagee

I like adding the filtering syntax to scatter. You could do:

Array[String] samples = ["blood", "tissue", "brain"]
scatter (sample in samples) {
  if (sample == "blood") {
    String sample_type = sample
  }
}

But then you end up with Array[String?] sample_type = ["blood", None, None], whereas

scatter (sample in samples if sample == "blood") {
  String sample_type = sample
}

would (presumably) yield Array[String] sample_type = ["blood"]. It's nice to not have to then select_all.

As for list comprehension, I'm supportive. I'd suggest dropping the parens and just using scatter as an alias for for. Or we could introduce for as a language keyword that would essentially be an alias for scatter.

List[String] sample_names = [sample.name scatter sample in samples if sample.type == "blood"]

or

List[String] sample_names = [sample.name for sample in samples if sample.type == "blood"]

May 23 '24 18:05 jdidion

I think the scatter keyword carries a connotation of the computation being distributed (even though that's not specified as a requirement of scatter - the engine has the option of executing it locally). So it might be a good idea to add the for keyword with the explicit requirement that for expressions are evaluated locally.

May 23 '24 18:05 jdidion

@jdidion although the implication with scatter is that the computations WITHIN the scatter are executed remotely but not the scatter itself

May 23 '24 18:05 patmagee

If this filtering functionality gets added to the scatter, is there really still a point in adding a separate list comprehension?

scatter(sample in samples if sample.include){String sample_names = sample.name}

and

Array[String] sample_names = [sample.name for sample in samples if sample.include]

would have the same effect.

Don't get me wrong, I like list comprehensions, but in this case it seems like it's just adding more syntax for people to learn. While the same effect can be achieved with already existing (except for the filtering) and equally concise syntax.

Regarding the distinction between scatter and for: I my mind, scatter implies simultaneous/parallel execution (regardless of whether that is done locally or distributed), whilst for implies consecutive execution (ie. the tradition programming for-loop). So every "iteration" of a scatter is independent and disconnected from every other "iteration". While an iteration of a for-loop may be impacted by the previous iterations.

Jun 04 '24 09:06 DavyCats

@DavyCats good question and I would say they are quite distinct, but the utility of the list comprehension is far greater.

Scatters are confined to workflows and introduce an inner scope and any variables set there will be available in the outer scope

List comprehension on the other hand

is concise
usable in tasks
usable within expressions
does not introduce anything within the outer scope

What I just realized is that we already have an implicit iterator foo in bar. So instead of adding new ways to refer to the same operation (ie a for keyword)maybe we just extend that

foo in bar if foo > 3: foo * 2

This can be the base form that fits really well into an existing scatter

scatter(foo in bar if foo > 3: foo * 2) {
  ...
}

Then you can use this in an expression directly

Array[File] files
Array[String] contents = [file in files if size(file, "mb") < 5: read_string(file)]

Jun 04 '24 14:06 patmagee

@patmagee this could work. I'm not sure about : (which we don't use anywhere else in wdl) vs. {}. E.g.

Array[String] contents = [file in files if size(file, "mb") < 5 { read_string(file) }]

Another thing to consider is an implicit iterator variable, e.g. it in Groovy. We already have input as a keyword, so we could repurpose that as an iterator variable:

Array[String] contents = [read_string(input) in files if size(file, "mb") < 5]

Jun 04 '24 16:06 jdidion

I wonder if it's better to just bite the bullet and introduce for as a keyword. Having sequential execution has been something that has been requested anyway so it could be a small step towards actual for loops

Jun 04 '24 23:06 patmagee

I feel like without introducing a more complex concept like map we are pigeonholed into introducing for, only for the sake of simplicity and consistency with user expectation

Array[String] = [ read_string(input) for file in files if size(file, "mb") < 5) ]

I am on the fence of whether we should add a scatter form at all, or keep the List comprehension constrained to the expression. While the fact that we can use it right in a scatter looks nice

# Simple form
scatter(foo in bar if foo > 3){

}

# mutate
scatter(foo * 2 for foo in bar if foo > 3){

}

It inevitably opens the door to a for scatter, that kind of looks weird

for (foo for foo in bar){

}

So it may actually be simpler to just use list comprehension inline. Although there is also the recursion issue again: x in .. b in ..

scatter(foo in [b * 2 for b in bar if b > 3]){

}

Jun 06 '24 16:06 patmagee

I agree with your last point - not necessary to change scatter syntax if we have list comprehension.

Jun 06 '24 16:06 jdidion

Part of the mental block with me on introducing list (or array in this case) comprehensions directly is because WDL doesn't have the ability to define new functions, nor does it have lambda functions. This means that anything more than the straightforward examples you provide is going to be pretty messy when crammed into the brackets, and expressing those sort of complex transformations succinctly is where list comprehensions shine in my view: the simpler cases, like what you show in your examples, aren't bothersome enough to extend the syntax in my view.

The example you gave that stuck out to me is this one:

List[String] sample_names = scatter(sample in samples if sample == "blood" ) { sample.name }

As has been discussed above, scatter can be pretty trivially used as a map/array comprehension, and your built in filtering solves the problem that John brought up about scatter otherwise returning an Array[String?] and requiring a select_all.

An example

Take this example, which I've noted where I think improvements could happen.

version 1.2

struct Sample {
  String kind
  String name
}

workflow run {
  Array[Sample] samples = [
    Sample {
      kind: "normal",
      name: "sample_one",
    },
    Sample {
      kind: "tumor",
      name: "sample_two",
    },
    Sample {
      kind: "normal",
     name: "sample_three",
    },
    Sample {
      kind: "tumor",
      name: "sample_four",
    },
  ]

  scatter (sample in samples) {
    # This `if` can be encapsulated in the scatter.
    if (sample.kind == "normal") {
      # This _could_ be pulled into the `scatter` or a list comprehension,
      # though I think the benefit is minimal.
      String name = sample.name
    }
  }

  output {
    # This `select_all` could be avoided by the scatter filter.
    Array[String] names = select_all(name)
  }
}

As I hope this demonstrates:

The value of introducing filtering to avoid boilerplate with the select_all seems apparent and something that is minimally disruptive to the existing syntax—I would be in favor of that.
The "mutation" of a value falls into one of two categories: both of which I think are not in favor of including it:
- If it's a trivial mutation (b * 2 or sample.name), I think its current form inside the scatter isn't noticeably painful to warrant the new syntax.
- If it's a nontrivial mutation, like mapping a whole new struct, list comprehensions are not going to be the right tool for that without the addition of custom functions or lambdas.

So, my end thought at the moment is to modify scatter to include optional filtering but not go through fully with list comprehension.

Jan 25 '25 20:01 claymcleod

In general I've never been super keen on more complex, non-task based operators. That said, if going this route the standard higher order function to convert As to Bs is map

Jan 25 '25 22:01 geoffjentry

wdl wdl copied to clipboard

Add List Comprehension to the SPEC

Support List Comprehension Directly

An example

wdl
wdl copied to clipboard