wdl icon indicating copy to clipboard operation
wdl copied to clipboard

feat: adds enumerations

Open claymcleod opened this issue 1 year ago • 21 comments

This commit adds enumerations to WDL.

Based on my discussions and browsing previous threads surround the idea, it sounds like the value of adding enumerations to WDL is broadly agreed upon as an improvement regarding the user experience for limiting assignment to a set of valid values within a particular context; it's simply the details that need to be hashed out.

In this PR, enumerations in WDL are valued, meaning that they have an assigned valued for each variant therein. These values can either be explicitly or implicitly typed. When it comes to the details, I try to take a common sense approach and a "middle of the road" position on the spectrum of conciseness versus flexibility. In particular:

  • The inner type can be elided from the enum's signature if it is trivially computed (leans towards conciseness).
  • When you'd like to be explicit with the inner type to resolve ambiguity, you can be.
  • Enumeration variants with no assigned value are assumed to be typed as a String with an assigned value equivalent to the variant name (leans towards common sense and conciseness).
  • Enumerations that are not of the same type cannot be compared, as that is equivalent to a type mismatch (leans towards common sense approach, as comparing two enums of different types implies different contexts, even if the inner values of the variants are the same).

I'll leave the details to what I've written in the changeset.

Checklist

  • [x] Pull request details were added to CHANGELOG.md
  • [x] Valid examples WDL's were added or updated to the SPEC.md (see the guide on writing markdown tests)

claymcleod avatar Jan 15 '25 05:01 claymcleod

I think this is ready for a preliminary review—I haven't added any tests yet, but that's because I want to make sure the idea, as I have codified it here, has support from a good number of people before investing the time to do that.

claymcleod avatar Jan 16 '25 05:01 claymcleod

Open question, can the values of an enum be scattered? In most programming languages Enums are iterable.

Great point. In keeping with the spirit of what was done for similar methods, like getting the keys of an object, I will add a variants standard library function.

claymcleod avatar Jan 16 '25 18:01 claymcleod

Could the function just be called enumerate

markjschreiber avatar Jan 16 '25 18:01 markjschreiber

Could the function just be called enumerate

Personally, I think it makes me sense to call it variants. This is to be consistent with the choice of the function name matching the concept name for enumeration of those concepts within Objects. For example, keys defines what is being enumerated, not the literal name of the enumeration action (enumerate, enumerate_keys, enumerate_values). Since each member of an enum is called a "variant", the name variants would align with the pre-existing convention. Furthermore, you might want to enumerate more than just the variants, so it has the added benefit of being co-introduced with an overload to the values function for an enum.

claymcleod avatar Jan 16 '25 21:01 claymcleod

I've incorporated all of the feedback above, and this should be ready for a re-review. That being said, I don't like what I have on the page right now because I realized something while writing the example (well, maybe two things).

  • To my knowledge, the variants method would be a new paradigm being added to WDL, as it's a function that takes a custom type itself rather than an instance of the type. For example, you can iterate on the keys in a Map, but that's called on an instance of Map. There are, for example, no methods that iterate the fields in a Struct.
  • During enum assignment, you're really assigning the value to one of the variants within the enum, not the enum itself. So it's not strictly true to call this variable "an instance of the enum". It is probably more correct to say "an instance of one of the enum's variants".

Here's an example to make sure it's clear why this distinction is important.

enum FileType {
  FASTQ,
  BAM
}

FileType foo = FileType.FASTQ

# You wouldn't call `variants` on `foo`—why would that be necessary? And does it even really make sense? 
# `foo` represents a variant _within_ the enum rather than the `enum` itself, so enumerating the variants
# within the enum itself seems strange.
variants(foo)

# You'd more likely want to, say, scatter across the variants from the enum type directly.
# But this introduces sort of a strange-looking function call.
variants(FileType)

I will continue to chew on this but curious if others have thoughts on what should be done here. My gut reaction is that perhaps the definition custom types (enums and structs) need to introduce a namespace within the same name as the type (or alias) within which standard library functions can be constructed automatically. This would be similar to the :: operator in Rust, e.g.,

scatter(type in FileType::variants()) {
  # ...
}

claymcleod avatar Jan 18 '25 17:01 claymcleod

I've incorporated all of the feedback above, and this should be ready for a re-review. That being said, I don't like what I have on the page right now because I realized something while writing the example (well, maybe two things).

  • To my knowledge, the variants method would be a new paradigm being added to WDL, as it's a function that takes a custom type itself rather than an instance of the type. For example, you can iterate on the keys in a Map, but that's called on an instance of Map. There are, for example, no methods that iterate the fields in a Struct.
  • During enum assignment, you're really assigning the value to one of the variants within the enum, not the enum itself. So it's not strictly true to call this variable "an instance of the enum". It is probably more correct to say "an instance of one of the enum's variants".

Here's an example to make sure it's clear why this distinction is important.

enum FileType {
  FASTQ,
  BAM
}

FileType foo = FileType.FASTQ

# You wouldn't call `variants` on `foo`—why would that be necessary? And does it even really make sense? 
# `foo` represents a variant _within_ the enum rather than the `enum` itself, so enumerating the variants
# within the enum itself seems strange.
variants(foo)

# You'd more likely want to, say, scatter across the variants from the enum type directly.
# But this introduces sort of a strange-looking function call.
variants(FileType)

I will continue to chew on this but curious if others have thoughts on what should be done here. My gut reaction is that perhaps the definition custom types (enums and structs) need to introduce a namespace within the same name as the type (or alias) within which standard library functions can be constructed automatically. This would be similar to the :: operator in Rust, e.g.,

scatter(type in FileType::variants()) {
  # ...
}

@claymcleod What about a new paradigm but one that I think will be easier to roll out: Adding class level methods

enum FileType {
  FASTQ,
  BAM
}

FileType.variants()

patmagee avatar Jan 20 '25 21:01 patmagee

What if we think of an enum as a special kind of Map whose keys are strings and values are of a hidden type.

So

enum Foo {
  BAR = "bar"
  BAZ = "baz"
}

would be equivalent to the following

# this is a hidden type
struct Foo {
  String value
}

# this is a singleton instance - hand-wave that the instance
# can have the same name as the type
Map[String, Foo] Foo = {
  "BAR": Foo { value = "bar" },
  "BAZ": Foo { value = "baz" }
}

If we wanted to, we could also extend existing methods to handle an argument of type enum, e.g.

  • keys would return the names of the variants.
  • Indexing ([]) would enable getting a variant by name
Foo bar1 = Foo.BAR
Foo bar2 = Foo["BAR"]

There's also the question of whether an enum variant is automatically coerced to it's value type, or if an explicit call to .value is required:

# this
String bar_string1 = bar1
# or this?
String bar_string2 = bar2.value
# or both?

jdidion avatar Jan 22 '25 20:01 jdidion

@claymcleod What about a new paradigm but one that I think will be easier to roll out: Adding class level methods

enum FileType {
  FASTQ,
  BAM
}

FileType.variants()

Based on my understanding of what you're saying, I think both suggestions are essentially equivalent. The only real question is "which operator should we chose"? I can only think of one relatively good argument, and it's in favor of :: (or, more strictly, anything other than .).

If we introduce these static methods with a . operator, the next logical question is "why can I not do something like <variable>.keys() instead of keys(<variable>)? After all, the use of . to access methods directly attached to an instance of a class is going to be something that users are used to (and perhaps even expect) from their experience using more general programming languages. In truth, I'm not sure if there is a good reason that something like that doesn't work other than simplicity. So, in this case, it might be quite discordant to find that the less obvious case of using . (accessing static methods) works but the more obvious case of using . (accessing methods on a class instance) doesn't work.

In the case of an operator like :: (or any operator other than .), I don't believe these expectations (and, thus, the perceived discordance) are introduced—because :: is a more uncommon operator, it will likely be more natural to think of this as much different than what a method accessor would do. In other words, it doesn't follow that the addition of :: to access namespaced methods will introduce frustration about the lack of a method accessor.

Just to be clear, I don't think this has to be :: specifically. If we don't like the way it looks or we don't want to lean too hard into what Rust is doing (I'm obviously biased here). I just think it should probably not be . based on the argument above. To that end, I don't have a better suggestion for what it should be, and I'd be open to hearing any.

claymcleod avatar Jan 23 '25 06:01 claymcleod

What if we think of an enum as a special kind of Map whose keys are strings and values are of a hidden type.

So

enum Foo {
  BAR = "bar"
  BAZ = "baz"
}

would be equivalent to the following

# this is a hidden type
struct Foo {
  String value
}

# this is a singleton instance - hand-wave that the instance
# can have the same name as the type
Map[String, Foo] Foo = {
  "BAR": Foo { value = "bar" },
  "BAZ": Foo { value = "baz" }
}

If we wanted to, we could also extend existing methods to handle an argument of type enum, e.g.

* `keys` would return the names of the variants.

* Indexing (`[]`) would enable getting a variant by name
Foo bar1 = Foo.BAR
Foo bar2 = Foo["BAR"]

There's also the question of whether an enum variant is automatically coerced to it's value type, or if an explicit call to .value is required:

# this
String bar_string1 = bar1
# or this?
String bar_string2 = bar2.value
# or both?

I think I'm not connecting what the proposed advantages of this design are over what I've got in the PR, so I'm not sure on my overall feeling on this. If you could share a bit more from your perspective, that'd be great. To address some smaller points:

Accessors

Foo bar1 = Foo.BAR
Foo bar2 = Foo["BAR"]

On this, I'm not sure I like enabled the second line here. The core value of introducing enums is that there is a way to define a closed set of values that can be checked at compile time. When you start accessing things as if they were a map, you run into the issue that the variant might not exist, causing a runtime exception. Beyond that, I just don't see a lot of value to being able to access variants this way: to me, enums seem much stronger if we keep them as a strongly typed and statically checked part of the language with limited scope (in other words, if you need a Map, then use a Map).

Coercion

# this
String bar_string1 = bar1
# or this?
String bar_string2 = bar2.value
# or both?

Leaning a bit on my philosophy I laid out above regarding the core value of enums, there's two contexts you can work with an enum in this proposal:

  • Within the WDL type system, wherein enumerations should be strictly typed and statically verifiable. So, if you are referring to an enum variant as a type, then it should be a unique, closed type that only exactly matches the enum variant itself—the value within is never even considered.
  • Within an evaluation context (e.g., expressions or the command section), there is a one-way conversion of the enum variant to its inner value. This decision is made because it is asserted that you never want to resolve the name of an enum variant type directly (and, if you do, the shorthand version where the value within is a stringified version of the variant name helps you).

claymcleod avatar Jan 23 '25 16:01 claymcleod

I think adding a new operator like :: would make the language look more intimidating to non programmers (and maybe even more casual programmers not familiar with Rust or R). This would go against one of the central premises behind the design of WDL. I'm wondering whether this is even a necessary feature to add (obtaining a list of variants that is; enums would be very useful). Is there an actual use case for it?

If you do want to add it, would it be an idea to simply add a reserved variants member variable to the enum type? Essentially, enums would simply have an additional implicitly added variant which equals a list of all other variants.

eg.

enum Animal {
    CAT,
    DOG
}

scatter (animal  in Animal.variants) {} # = [Animal.CAT, Animal.DOG]

Regarding coercion, one might be using a task which takes some integer as input. Perhaps some filtering threshold. The task and underlying tool support any integer value. However, in this particular workflow, only a handful of values might actually be supported:

# task.wdl (Some third party developed task)
task filter_variants {
    input {
        Int threshold
        File vcf
    }
[...]

# workflow.wdl
import "https://example.com/task.wdl" as filter

enum FilterValue[Int] {
  STRICT = 10,
  NORMAL = 5,
  WHO_CARES = 1
}

workflow somatic {
    input {
        FilterValue stictness
        File tumor_fastq
        File normal_fastq
    }
[...]
    call filter.filter_variants {
        input:
            threshold = strictness, # coercion from FilterValue to Int
            vcf = vcf
    }
[...]

Would/should this be allowed?

DavyCats avatar Jan 24 '25 09:01 DavyCats

I think adding a new operator like :: would make the language look more intimidating to non programmers (and maybe even more casual programmers not familiar with Rust or R). This would go against one of the central premises behind the design of WDL. I'm wondering whether this is even a necessary feature to add (obtaining a list of variants that is; enums would be very useful). Is there an actual use case for it?

If you do want to add it, would it be an idea to simply add a reserved variants member variable to the enum type? Essentially, enums would simply have an additional implicitly added variant which equals a list of all other variants.

eg.

enum Animal {
    CAT,
    DOG
}

scatter (animal  in Animal.variants) {} # = [Animal.CAT, Animal.DOG]

I think this is a great point about not being approachable. So to me, that kills :: specifically. I do think there is going to be a use case for scattering over enum variants, but I'm also happy to make a case for it if we feel like we need to. In the meantime, I think there are three viable approaches to getting variants:

  • variants(FileType). I think this looks weird, but maybe it's okay.
  • FileType.variants(). We don't call functions/methods in this way historically, and perhaps it even runs afoul of the point you made above regarding being non-approachable (because it complicates the ways in which you can call a function on something—now I have to know what context I'm in and call it the right way).
  • FileType.variants. It feels strange to me to rather arbitrarily dictate that there is a field named variants that is reserved. It's also a bit magical to automatically get this field assigned to the enum.

Overall, I think I'm leaning towards coming back to the first one—it looks a bit weird, but it's consistent with the way we've called functions in the past, and I think it's the least magical. Perhaps once we put in the right syntax highlighting, this wouldn't seem as weird as I'm envisioning.

Regarding coercion, one might be using a task which takes some integer as input. Perhaps some filtering threshold. The task and underlying tool support any integer value. However, in this particular workflow, only a handful of values might actually be supported:

# task.wdl (Some third party developed task)
task filter_variants {
    input {
        Int threshold
        File vcf
    }
[...]

# workflow.wdl
import "https://example.com/task.wdl" as filter

enum FilterValue[Int] {
  STRICT = 10,
  NORMAL = 5,
  WHO_CARES = 1
}

workflow somatic {
    input {
        FilterValue stictness
        File tumor_fastq
        File normal_fastq
    }
[...]
    call filter.filter_variants {
        input:
            threshold = strictness, # coercion from FilterValue to Int
            vcf = vcf
    }
[...]

Would/should this be allowed?

In this case, as proposed, I think the automatic coercion would not work out of the box—doing the coercion implicitly within the type system is probably not a good idea. There might be a case for a method called value(<variant>) that gets the value out of a variant to satisfy the condition you're talking about here.

claymcleod avatar Jan 24 '25 15:01 claymcleod

My preference is also for the first alternative (e.g. variants(FileType)), simply for consistency with the rest of WDL and to avoid adding new/overloaded syntax, but thinking about it from a programming language perspective, to treat it as a function call to variants, it would need to accept a value in that position as functions take values. I would greatly prefer not to go down the road of having "special" functions that aren't of the production form (value*) -> value.

In this instance, FileType would be a name reference expression and normally we'd try to resolve that against local names (there's nothing special about custom type names in WDL). For this to work, we'd have to also check the reference against custom type names and also prohibit a local name from conflicting with a custom type name; I'm not sure if the spec already requires that, but right now miniwdl warns about the name conflict and sprocket does not emit a diagnostic. The expression would then need to produce a "type reference" value which can be passed to the function.

Type references (i.e. the ability to reference a type as a runtime value) are not currently represented in WDL's value system.

The other syntaxes would not require modifying the value system to work as they could be accomplished by special-casing the access expression to see if we're operating on a name reference to a custom type and skip evaluation of the name reference expression (i.e. the expression FileType would therefore not need to produce a value).

It also means that the expression (FileType).variants() would not be equivalent as we'd treat that as a name reference to FileType, which wouldn't exist as a local name, assuming the conflicting name would be prohibited. This distinction can be seen in languages that use a single access operator for both static and instance member access (for example Java and C#).

peterhuene avatar Jan 24 '25 17:01 peterhuene

One quick thought based on a skim: Given the primary domain in which WDL is used, .variants() maybe a bit too overloaded of a term

geoffjentry avatar Jan 24 '25 17:01 geoffjentry

Given the contention that the variants function has, I wonder if we need to release the enum feature with it, or if we can release and gauge interest in that feature. If we see Workflows in the wilds that are doing something like:

scatter(type in [FileType.FASTQ,FileType.BAM]) {

}

Then we have a pretty good signal people actually want it and we can come back to it

patmagee avatar Jan 24 '25 18:01 patmagee

Yeah I think that's reasonable Patrick. So, I say we scratch the notion of enumerating variants for now and see whether people ask for it. As you show, it will be possible to enumerate over values explicitly like that, it's just annoying and not really scalable.

This also limits the impact of Geoff's observation, as the only time users will need to wrangle with them being called variants is in the documentation (I think variants is still the proper choice here, but I do agree that conflating the topics of variants in these two contexts on a day-to-day level are not helpful).

After that decision, I'm quite sure that all of the feedback has been addressed and now this is just ready for a review again.

claymcleod avatar Jan 24 '25 20:01 claymcleod

Thanks for the feedback! After a few months of thinking on this, I've come back with a fresh mind and made the following decisions.

@peterhuene, I've addressed your your concerns in the following ways:

  1. Changed the FavoriteNumber example to correctly coerce to Float and added a new error example showing truly incompatible types.
  2. Explicitly documented that enum values can be any WDL type, including compound types like Array[String] and Map[String, Int] (see example updates).
  3. Kept the explicit type syntax enum Foo[T] to as you discussed if compound types were allowed.
  4. Made clear that enum values must be literal expressions since enums are global.

Other improvements:

  • Replaced the .name member with a name() standard library function to get the stringified variant name. This was needed after some thought regarding how a struct might have a .name member.
  • Added names() and variants() functions as well.

claymcleod avatar Oct 26 '25 20:10 claymcleod

Apologies, I forgot to push my changes on the last comment.

@patmagee, I've added a section on serialization/deserialization. In brief, I think serializing things in "user land" (i.e., the JSON inputs and outputs) should be using the key while serializing enums in the command block should be using the inner value.

claymcleod avatar Oct 27 '25 00:10 claymcleod

  • I think it makes sense to refer to an enum as a kind of custom type. Technically, it would be most like a type alias with a restricted set of values.
  • I am against introducing more implicit typing on principal, as I think we've seen some of these issues this can introduce and have been moving towards making WDL more explicit. I am fine with enum variants being able to take values that are defined using literals of different types, as long as those values are immediately coerced to the enum's underlying type. I think it should be required to always specify the type of an enum when there is any ambiguity.
    # eliding the type is ok as it's unambiguously `String`
    enum Simple { A, B }
    # eliding the type is not ok as variants are defined using literals of different types
    enum Ambig[Float] { X = 1, Y = 2.0 }
    # the type is unambiguously `Array[Integer]` so we could elide the type, but
    # I would still be in favor of requiring the type to be explicit for both clarity and
    # ease of implementation
    enum Shrug[Array[Integer]] { M = [1, 2, 3], N = [4, 5, 6] }
    
  • Do we support coercion of an enum variant to it's underlying type? I think the answer should be no. I like the idea of an explicit value function. That way, different enum variants are unequal (thus answering your question about equality) whereas their values can be the same.
    enum Foo[String] { Val1 = "Hello", Val2 = "Hello" }
    
    workflow test {
        if (Foo.Val1 == Foo.Val2) { # evaluates to false }
        if (value(Foo.Val1) == value(Foo.Val2)) { # evaluate to true }
    }
    
  • The names() and variants() functions are unlike any other in the standard library - I think it might be useful to introduce new notation, e.g. Array[String] names(T) where T is a custom type (Struct or Enum).
  • I also don't think these functions are strictly necessary. We could instead use python-inspired notation:
    enum Foo {
        Bar = 1,
        Baz = 2
    }
    
    workflow test {
        # scatter over variants of Foo
        scatter (foo in Foo) {
            # get variant of Foo using dot notation
            if (foo == Foo.Bar) { ... }
            # get variant of Foo by name
            if (foo == Foo["Bar"]) { ... }
            # get name of Foo variant
            if (name(foo) == "Bar") { ... }
            # get value of Foo variant
            if (value(foo) == 1) { ... }
        }
    }
    

jdidion avatar Oct 30 '25 17:10 jdidion

  • I think it makes sense to refer to an enum as a kind of custom type. Technically, it would be most like a type alias with a restricted set of values.

    • I am against introducing more implicit typing on principal, as I think we've seen some of these issues this can introduce and have been moving towards making WDL more explicit. I am fine with enum variants being able to take values that are defined using literals of different types, as long as those values are immediately coerced to the enum's underlying type. I think it should be required to always specify the type of an enum when there is any ambiguity.

      # eliding the type is ok as it's unambiguously `String`
      enum Simple { A, B }
      # eliding the type is not ok as variants are defined using literals of different types
      enum Ambig[Float] { X = 1, Y = 2.0 }
      # the type is unambiguously `Array[Integer]` so we could elide the type, but
      # I would still be in favor of requiring the type to be explicit for both clarity and
      # ease of implementation
      enum Shrug[Array[Integer]] { M = [1, 2, 3], N = [4, 5, 6] }
      
    • Do we support coercion of an enum variant to it's underlying type? I think the answer should be no. I like the idea of an explicit value function. That way, different enum variants are unequal (thus answering your question about equality) whereas their values can be the same.

      enum Foo[String] { Val1 = "Hello", Val2 = "Hello" }
      
      workflow test {
          if (Foo.Val1 == Foo.Val2) { # evaluates to false }
          if (value(Foo.Val1) == value(Foo.Val2)) { # evaluate to true }
      }
      
    • The names() and variants() functions are unlike any other in the standard library - I think it might be useful to introduce new notation, e.g. Array[String] names(T) where T is a custom type (Struct or Enum).

    • I also don't think these functions are strictly necessary. We could instead use python-inspired notation:

      enum Foo {
          Bar = 1,
          Baz = 2
      }
      
      workflow test {
          # scatter over variants of Foo
          scatter (foo in Foo) {
              # get variant of Foo using dot notation
              if (foo == Foo.Bar) { ... }
              # get variant of Foo by name
              if (foo == Foo["Bar"]) { ... }
              # get name of Foo variant
              if (name(foo) == "Bar") { ... }
              # get value of Foo variant
              if (value(foo) == 1) { ... }
          }
      }
      

Everything you said here either (a) makes sense to me or (b) is already how the proposal lays things out.

claymcleod avatar Oct 31 '25 14:10 claymcleod

After some discussion with Peter, I've decided to walk back introducing the names() and variants() functions in WDL v1.3. Changing the language to support types themselves being passed into standard library functions seems like it might be a bit much to roll into this release, and these methods aren't strictly needed to get most of the value. We'll still have the value() method, since that is needed and operates on a value instead of a type.

claymcleod avatar Oct 31 '25 15:10 claymcleod

This is now fully implemented here: https://github.com/stjude-rust-labs/sprocket/pull/445. I believe all of the feedback has been addressed.

claymcleod avatar Nov 06 '25 17:11 claymcleod