wdl
wdl copied to clipboard
Add List Comprehension to the SPEC
A common pattern that has evolved in WDL is the need to convert a list of type A into a list of type B. Most programming languages support list comprehension to accomadate this, but WDL has no built in method for it. Coincidentally we have already implemented a similar pattern with ternary operators (if-then-else) but have not gone as far as list comprehension
Currently, you can fake list comprehension using a scatter
List[Sample] samples
scatter(sample in samples){
String sample_names = sample.name
}
# List[String] sample_names
While this is technically a form of list comprehension, it has a few drawbacks
- Additional boilerplate
- Assignment to an extra non list variable may be confusing (we should address this in the future)
- Cannot be used within expressions
Support List Comprehension Directly
The simplest strategy we can take is to support list comprehension directly within WDL, potentially following the Python syntax
List[Sample] samples
List[String] sample_names = [sample.name for sample in samples]
One unintended consequence (for better or worse) of the above example is that it adds for as a reserved KW to the specification, without introducing a for construct. I think this is an acceptable tradeoff, but if we wanted to keep with existing semantics we could do something
List[String] sample_names = [sample.name scatter(sample in samples)]
List[String] sample_names = [scatter(sample in samples) sample.name]
# maybe there is a different mode of scatter
List[String] sample_names = scatter(sample in samples) { sample.name }
An additional benefit to list comprehension is the ability to easily add filtering
List[String] sample_names = [sample.name for sample in samples if sample.type == "blood"]
List[String] sample_names = [sample.name scatter(sample in samples) if sample == "blood"]
List[String] sample_names = [sample.name scatter(sample in samples if sample == "blood")]
List[String] sample_names = scatter(sample in samples if sample == "blood" ) { sample.name }
It does not resemble python in the same way, But I personally kind of like [sample.name scatter(sample in samples if sample == "blood")], because it would potentially allow the following pattern:
# Filtered scatters without reassignment!
scatter(sample in sample if sample == "blood"){
call foo
call bar
call biz
}
I like adding the filtering syntax to scatter. You could do:
Array[String] samples = ["blood", "tissue", "brain"]
scatter (sample in samples) {
if (sample == "blood") {
String sample_type = sample
}
}
But then you end up with Array[String?] sample_type = ["blood", None, None], whereas
scatter (sample in samples if sample == "blood") {
String sample_type = sample
}
would (presumably) yield Array[String] sample_type = ["blood"]. It's nice to not have to then select_all.
As for list comprehension, I'm supportive. I'd suggest dropping the parens and just using scatter as an alias for for. Or we could introduce for as a language keyword that would essentially be an alias for scatter.
List[String] sample_names = [sample.name scatter sample in samples if sample.type == "blood"]
or
List[String] sample_names = [sample.name for sample in samples if sample.type == "blood"]
I think the scatter keyword carries a connotation of the computation being distributed (even though that's not specified as a requirement of scatter - the engine has the option of executing it locally). So it might be a good idea to add the for keyword with the explicit requirement that for expressions are evaluated locally.
@jdidion although the implication with scatter is that the computations WITHIN the scatter are executed remotely but not the scatter itself
If this filtering functionality gets added to the scatter, is there really still a point in adding a separate list comprehension?
scatter(sample in samples if sample.include){String sample_names = sample.name}
and
Array[String] sample_names = [sample.name for sample in samples if sample.include]
would have the same effect.
Don't get me wrong, I like list comprehensions, but in this case it seems like it's just adding more syntax for people to learn. While the same effect can be achieved with already existing (except for the filtering) and equally concise syntax.
Regarding the distinction between scatter and for: I my mind, scatter implies simultaneous/parallel execution (regardless of whether that is done locally or distributed), whilst for implies consecutive execution (ie. the tradition programming for-loop). So every "iteration" of a scatter is independent and disconnected from every other "iteration". While an iteration of a for-loop may be impacted by the previous iterations.
@DavyCats good question and I would say they are quite distinct, but the utility of the list comprehension is far greater.
Scatters are confined to workflows and introduce an inner scope and any variables set there will be available in the outer scope
List comprehension on the other hand
- is concise
- usable in tasks
- usable within expressions
- does not introduce anything within the outer scope
What I just realized is that we already have an implicit iterator foo in bar. So instead of adding new ways to refer to the same operation (ie a for keyword)maybe we just extend that
foo in bar if foo > 3: foo * 2
This can be the base form that fits really well into an existing scatter
scatter(foo in bar if foo > 3: foo * 2) {
...
}
Then you can use this in an expression directly
Array[File] files
Array[String] contents = [file in files if size(file, "mb") < 5: read_string(file)]
@patmagee this could work. I'm not sure about : (which we don't use anywhere else in wdl) vs. {}. E.g.
Array[String] contents = [file in files if size(file, "mb") < 5 { read_string(file) }]
Another thing to consider is an implicit iterator variable, e.g. it in Groovy. We already have input as a keyword, so we could repurpose that as an iterator variable:
Array[String] contents = [read_string(input) in files if size(file, "mb") < 5]
I wonder if it's better to just bite the bullet and introduce for as a keyword. Having sequential execution has been something that has been requested anyway so it could be a small step towards actual for loops
I feel like without introducing a more complex concept like map we are pigeonholed into introducing for, only for the sake of simplicity and consistency with user expectation
Array[String] = [ read_string(input) for file in files if size(file, "mb") < 5) ]
I am on the fence of whether we should add a scatter form at all, or keep the List comprehension constrained to the expression. While the fact that we can use it right in a scatter looks nice
# Simple form
scatter(foo in bar if foo > 3){
}
# mutate
scatter(foo * 2 for foo in bar if foo > 3){
}
It inevitably opens the door to a for scatter, that kind of looks weird
for (foo for foo in bar){
}
So it may actually be simpler to just use list comprehension inline. Although there is also the recursion issue again: x in .. b in ..
scatter(foo in [b * 2 for b in bar if b > 3]){
}
I agree with your last point - not necessary to change scatter syntax if we have list comprehension.
Part of the mental block with me on introducing list (or array in this case) comprehensions directly is because WDL doesn't have the ability to define new functions, nor does it have lambda functions. This means that anything more than the straightforward examples you provide is going to be pretty messy when crammed into the brackets, and expressing those sort of complex transformations succinctly is where list comprehensions shine in my view: the simpler cases, like what you show in your examples, aren't bothersome enough to extend the syntax in my view.
The example you gave that stuck out to me is this one:
List[String] sample_names = scatter(sample in samples if sample == "blood" ) { sample.name }
As has been discussed above, scatter can be pretty trivially used as a map/array comprehension, and your built in filtering solves the problem that John brought up about scatter otherwise returning an Array[String?] and requiring a select_all.
An example
Take this example, which I've noted where I think improvements could happen.
version 1.2
struct Sample {
String kind
String name
}
workflow run {
Array[Sample] samples = [
Sample {
kind: "normal",
name: "sample_one",
},
Sample {
kind: "tumor",
name: "sample_two",
},
Sample {
kind: "normal",
name: "sample_three",
},
Sample {
kind: "tumor",
name: "sample_four",
},
]
scatter (sample in samples) {
# This `if` can be encapsulated in the scatter.
if (sample.kind == "normal") {
# This _could_ be pulled into the `scatter` or a list comprehension,
# though I think the benefit is minimal.
String name = sample.name
}
}
output {
# This `select_all` could be avoided by the scatter filter.
Array[String] names = select_all(name)
}
}
As I hope this demonstrates:
- The value of introducing filtering to avoid boilerplate with the
select_allseems apparent and something that is minimally disruptive to the existing syntax—I would be in favor of that. - The "mutation" of a value falls into one of two categories: both of which I think are not in favor of including it:
- If it's a trivial mutation (
b * 2orsample.name), I think its current form inside thescatterisn't noticeably painful to warrant the new syntax. - If it's a nontrivial mutation, like mapping a whole new struct, list comprehensions are not going to be the right tool for that without the addition of custom functions or lambdas.
- If it's a trivial mutation (
So, my end thought at the moment is to modify scatter to include optional filtering but not go through fully with list comprehension.
In general I've never been super keen on more complex, non-task based operators. That said, if going this route the standard higher order function to convert As to Bs is map