wdl Support Defining Environment Variables Within a Task

Support Defining Environment Variables Within a Task

Open patmagee opened this issue 2 years ago • 40 comments

This PR is the result of a few discussions that have been ongoing over the past while (https://github.com/openwdl/wdl/discussions/458 and https://github.com/openwdl/wdl/discussions/443) of different ways to make improve the security and usability of WDL.

Declare Environment Variables within a Task

This PR adds the ability to declare environment variables within the task scope, as well as to pass values for those variables within a call block. Most WDL engines rely on templating in values to the command section prior to command execution. While this is convenient and allows us to define our own expression language within a command block, it does mean that any WDL command is subject injection attacks or poor formatting during execution.

Using environment variables circumvents this issue by not relying on templating at all. Instead the execution engine safely escapes strings prior to setting them within the container (or other execution) environment at runtime. Environment variables are not usable within other WDL expressions and once defined are accessed using the normal shell semantics associated with the given interpreter.

Environment variables are all implicitly strings, and therefore can be set to any expression which either evaluates to a string, or is coercible to one.

task foo {
  input {
   String my_inp
  }
  env {
   MY_ENV_WITH_DEFAULT = my_inp
   MY_ENV
  }

  command <<<
    echo $MY_ENV $MY_ENV_WITH_DEFAULT
  >>>
}

workflow bar {
  
  call foo {
    input : my_inp = "foo"
    env: MY_ENV = "bar"
  }
}

Checklist

[ ] Pull request details were added to CHANGELOG.md

May 13 '22 20:05 patmagee

Thanks @patmagee for moving this along! As you know, I strongly support the general idea of enabling environment variables as an alternative (often a superior one) to ~{} text interpolations. As to the fine details of how to express it:

I'd prefer not to need the separate env: section of the calls. I think of the env entries as simply additional string inputs, which the engine will know to set in the container environment. If we make input & env more-separate concepts as proposed, then WDL implementations need to be extended to model them separately (for example- WDL.Task.available_inputs and WDL.Task.required_inputs need to mitosis into inputs & envs, and then all code consuming them needs to consider both). It'll be easier if we think of envs as just more inputs, that the runtime does something a bit different with.

So the call would just look like

  call foo {
    input : my_inp = "foo",
            MY_ENV = "bar"
  }

(Aside- I think we should get rid of the useless input: keyword in calls anyway, but that'll be another thread!)

Another idea worth considering was to use type specializations instead of a separate section, keeping the current declaration syntax, e.g.,

task foo {
  input {
   SomeStruct my_inp
   EnvString MY_ENV
  }

  EnvString STRUCT_JSON_FILE = write_json(my_inp)

  command <<<
    echo $MY_ENV $STRUCT_JSON_FILE
  >>>
}

workflow bar {
  
  call foo {
    input : my_inp = SomeStruct { ... },
            MY_ENV = "bar"
  }
}

This allows the task author to control which environment variables the caller can and can't override.

My feeling is that either of the above approaches do the job well and are somewhat simpler for both WDL authors and engine implementers.

May 14 '22 08:05 mlin

Another angle on the prior comment: the container environment variables are an implementation detail that should be encapsulated inside the task. The caller doesn't really need to know whether a given input will turn into a container environment variable or not. The task author might want to change how they consume an input from interpolation to environment variable, and such a change needn't trigger required changes in calls too.

May 14 '22 08:05 mlin

@mlin that is a very valid point. The one thing I was trying to avoid was the automatically setting environment variables in the task based on input strings. Without a special prefix that would likely be error prone and potentially break many different tasks.

I must admit, I was initially torn between a type and a an env block. Both felt natural, the block felt like it encapsulated the concept of environment variables well. But having them buble up to the level of the workflow does feel odd to be honest. I guess that leaves us with a few approaches

Option 1: Keep the block in a task but require values to be set

This option would be a good way of being very declarative. We can keep the env block within a task that I created. The difference though is that every item must be assigned, either by a literal value or from an input value / private declaration.

The benefit to this approach is that it proves fine grained control over what gets exposed in the environment and it groups everything together.

task foo {
  input {
    String myInp
    Int maxNum
  }
  
  env {
   MYINP = myInp
   OTHER = "someValue"
   MAXNUM = maxNum
  }
}

The drawback to this approach is that it adds potentially more boilerplate to a task. The declared environment variables would still only be usable within the command block, but the workflow has no concept of environment variables

Option 2: Use a declared type

This would be following your suggestion and use a type like EnvString. It makes sense and I rather like it, especially if that string is treated like a normal string in any other wdl expression.

task foo {
  input {
   EnvString FOO
  }
}

[ ] The drawback of this approch is it adds a new type handled the same as a string but with special meaning

Option 3: add some sort of qualifier to the type

We could also go the route of adding an additional qualifier to the type or the declaration some ideas would be.

task foo {
  input {
    # explicitly state where to expose the string. this would be my favourite probably
    String FOO in env
    # use a type qualifier preceding the type. This one could also be my favorite
    env String FOO
    # Use some sort of qualifier coming after the type declaration
    String@env FOO
    String:env FOO
    String{env}

Type qualifiers have the drawback (or benefit) of being applicable to any type. Therefore any type could potentially be set in the env and it's the engines role to figure that out. Ie serialization of a struct or other complex type

I am definitely with you that we do not want to expand the concept of what constitutes an input, at least at this point.

May 14 '22 13:05 patmagee

# use a type qualifier preceding the type. This one could also be my favorite
env String FOO

I like this too!

One further question would be whether to allow env to prefix any type (Int, SomeStruct, etc.) and let the engine string-ify it (perhaps as JSON for compound types).

I think I'd say yes, but could be convinced otherwise

May 15 '22 03:05 mlin

@mlin I do not see a reason to limit this to only strings. I think allowing any type would gain a lot of support from the community. You could put it in a struct then easily use something like jq to manipulate the contents of the env variable

env Map[String,Array[String]] COMPLEX_TYPE
env MyStruct FOO

One question then is whether the env qualifier can be used outside of the task block. And whether it changes the type signature at all. For example would String and env String be equivalent? I think the answer is yes in the context of wdl expressions. Which if that is the case, the env keyword is less a qualifier and more a directive, telling the engine how to handle that variable

May 15 '22 16:05 patmagee

Adding something like directives as a concept can generalize as well. For example in a task you could define readonly File myFile or remote File foo to indicate that a file should not be localized.

May 15 '22 16:05 patmagee

Using environment variables circumvents this issue ..."

"Environment variables are not usable within other WDL expressions"

Due to environment variables not being usable in WDL expressions, that limits their power quite a bit. For some input variables being able to derive other variables from them is quite essential for the good functioning of a task. Therefore environment variables are never a complete solution. As such there is not one way to define inputs, but two equally valid ways when this becomes part of the spec.

and once defined are accessed using the normal shell semantics associated with the given interpreter.

I find this a bit shaky ground, as this means that the engine and the shell will be more tightly coupled. I know the current spec defines bash as the standard shell, but some containers do not have bash. As such, on biowdl/tasks we try to avoid bashisms to make sure any task can run on other shells as well.

My main problem with adding environment variables is that instead of one, we create two ways of defining inputs. This is quite confusing, as it is quite arbitrary which method is to be preferred. If I may quote the Zen of Python: "There should be one-- and preferably only one --obvious way to do it." This change violates that principle and therefore it makes WDL less usable.

Given that the actual problem that we are trying to solve here is " injection attacks or poor formatting during execution" I think the solution should be more tailored towards that end. Another solution is to quote all WDL variables by default while templating and to provide a function that disables this (conveniently called unsafe(...)). That is a much smaller and more targeted solution to the problem at hand.

May 16 '22 05:05 rhpvorderman

Environment variables are a universal mechanism that every shell/language interpreter will have an idiomatic way to consume, and will also be more familiar to authors learning WDL who do have other scripting experience.

I think shakier ground is the suggestion that engines could escape each value in a way that would be so universally safe and compatible. That suggestion works better if, on the contrary, we double-down on specifying that bash is the interpreter -- but even then, we'd face the common pattern of the bash script running an embedded heredoc in another language.

Re "Environment variables are not usable within other WDL expressions" -- I doubt that restriction is really needed. I view them as additional input declarations, so of course, they should be usable in other WDL expressions.

Re creating two ways of defining inputs -- it's an important point but my personal opinion is that the interpolation of strings directly into the command script was a minor design flaw from the beginning, and we should gradually deprecate them in favor of environment variables; but I don't expect everyone to share that opinion exactly...

May 16 '22 06:05 mlin

@mlin. Agreed. Redesigning WDL to only use environment variables and deprecate the templating is the better option.

Re "Environment variables are not usable within other WDL expressions" -- I doubt that restriction is really needed. I view them as additional input declarations, so of course, they should be usable in other WDL expressions.

If I add up things correctly I come to the following conclusions:

There is no need for an additional syntax. Input variables can be declared in their current way.
Input variables can be used in expressions in any part of the WDL. Except the command block.
Inside the command block, only variable names can be used not expressions (i.e. ~{my_var} not {my_var * 5}. These variables are passed as environment variables.

This solves the templating/injection issue. While restricting the language instead of expanding it. That seems preferable to me.

May 16 '22 07:05 rhpvorderman

Redesigning WDL to only use environment variables and deprecate the templating is the better option.

That's a really interesting concept. It would let us remove the (command) interpolation syntax entirely. And allow things like arrays to be supplied natively. Are there any use cases which would get worse if this were the case?

May 16 '22 18:05 cjllanwarne

Are there any use cases which would get worse if this were the case?

Supplying optional variables comes to mind. Where you can do ~{"--optional-flag " + optional_value} now. That is a very common use case.

I guess that could still be made possible for optional values without templating? So that it either evaluates to "" or --optional-flag $OPTIONAL_VALUE.

In retrospect my statement "Input variables can be used in expressions in any part of the WDL. Except the command block." was a bit harsh. I guess you can still allow any sort of expression, as long as it evaluates to a singular variable that can be saved as an environment variable and does not depend on being templated to text.

May 17 '22 05:05 rhpvorderman

And another thing that is hairy: How to work with tools that use the same flag multipe times to specify output files. For instance -o output1.fastq.gz -o output2.fastq.gz -o output3.fastq.gz. Since you can have either R1, R1,R2, or R1,R2,R3 it makes sense to have the tool use a flexibel output flag like that. How to use that when disallowing templating? (I think templating is not too bad in this case, since the developer specifies the sep, the env vars can still be created as ENV_ONE, ENV_TWO etc.)

May 27 '22 12:05 rhpvorderman

CWL has a pretty nice mechanism for constructing command lines, I think there's value in being separate from that though, for that reason I do like the command block in WDL, and generally making information available via env variables

May 27 '22 23:05 illusional

So, how about this for a concrete proposal:

Instead of the env { .... } block introduced in this PR, we add a new concept to WDL called Directives or Decorators (lets fight over what name seems better here). These are symbols that immediately preceed the type declaration and provide additional handling information for a specific Wdl Declaration.

In relation to environment variables, we can use the env direction/decorator/annotation to signal to a task that an environment variable should be generated from the given WDL declaration.


task foo {
  input {
    env String HELLO
  }

  # Also works
  String HELLO_FROM_BOB = HELLO + " from bob"

  command <<<
    echo ${HELLO}
    # Still works
    echo ~{HELLO}
  >>> 

}

Jun 14 '22 12:06 patmagee

@patmagee while I like the concrete proposal env String etc., as well as some of the other ideas mentioned, I'm a little wary of calling it an instance of a more-general feature (Directive/Decorator) without thinking through what else we might have that feature do, which might delay us.

As an example of the can of worms I think that could open, a long time ago @orodeh and I sketched out adding a metadata section for each input declaration, e.g.

input {
  File reads { help: "BAM file", patterns: ["*.bam"] }
}

To be clear: I'm NOT trying to advance that now, just mentioning the kind of thing that might bog us down if we get into use cases for a general concept like Directive/Decorator.

Jun 17 '22 08:06 mlin

@mlin good point, I am happy to keep this restricted to just the env keyword for now and see where we naturally extend this too

Jun 17 '22 14:06 patmagee

I modified the PR to remove the explicit env {...} section to allow ANY task input or declaration to be set as an environment variable using the env modifier

Aug 10 '22 20:08 patmagee

@mlin @illusional @rhpvorderman @cjllanwarne do you have any issues with the current approach?

Sep 21 '22 16:09 patmagee

@patmagee, my primary comment is that I think this is a inverse approach by requiring the maintainer to put env in front off everything to make sure no shell code can be injected. Forget the env keyword one time? That is a security-hole. If security is a concern, environment variables should be the default, and in cases where that leads to problems a template keyword should be used to distinguish variables that rely on the old behavior. EDIT: (This would of course be a change for 2.0 then, as it is not fully backwards-compatible.)

Speaking from a BioWDL perspective: we don't have any pressing security concerns since all our tasks are containerized, and only user inputs are mounted in the containers. The containers are run as the UID exclusively and not as root. Furthermore our pipelines are not run as a service, and run by individual users on our compute cluster. They can only break things they have the right to break in the first place so it makes no sense adding lots of guards against this. In practice security is a non-issue for us. I say this not to invalidate the PR, but to give the proper weight to my own opinion: very low. We have no stake in this.

Sep 23 '22 06:09 rhpvorderman

I have been thinking somewhat more about this and I start to wonder: what does this feature protect you from that containerization does not already? If users can submit their own workflows to the system, then this addition is pointless. So this protects specifically those instances where an entity provides WDL as a service, but did not containerize the backend properly? (Instead opting to provide all bioinformatics tools under the sun in a single environment without dependency conflicts). That seems a very unlikely use case. There is a clear trade-off at work here. The language will be more complex, but also more secure. But how much more secure? Without a detailed assessment of the security implications it is impossible to make a proper judgement. On the other hand, with the changes I proposed above (environment variables are default) the complexity cost is going to be so low that it is worth it, in my opinion.

Sep 23 '22 07:09 rhpvorderman

@rhpvorderman you make some good points. To start, containerization gets you pretty far down the road of security assuming that the system is using a secure container technology. Unfortunately even in the most restricted environments where users can only pick from a list of pre-approved workflows there is the opportunity for abuse. For example, If you happen to know the execution environment, you could construct a nefarious string that would run a specific attack within the container (not so bad), but more importantly within the network the container is running in potentially under any authority assigned to that container/vm. Google tools are great examples of this, where there is a metadata api allowing you to run gsutil within a container and actually retrieve a service account dynamically simply by running gsutil on a VM.

So all that is to say, I think in some circumstances the security concerns are warranted, but I agree that the overall impact to the user should be as minimal as possible. To be honest, the original motivator was to not override any existing variable within the container by automagically converting inputs into env variables. This actually may not be a big concern, especially not if this is a well documented feature. If all values were accessible either through the WDL semantics (~{foo}) AND their shell semantics ('${foo}`) it would probably be the most flexibly and lowest impact to the user.

Sep 23 '22 13:09 patmagee

Thanks for the further clarification on the security impact. I am in favour of adding this feature and I hope we can find an ergonomic way to implement this.

If all values were accessible either through the WDL semantics (~{foo}) AND their shell semantics ('${foo}`) it would probably be the most flexibly and lowest impact to the user.

That is actually a quite interesting thought. What if we moved the env keyword to the interpolation in the command section rather than the input section? So we can do stuff like: --foo ~{env bar} which will be automatically turned into --foo $WDL_VAR_UUIDSTUFF_BAR by the engine (or something like that)? That would keep the inputs section clean and allow a more targeted approach when it comes to environment variables in the command templates. To me that is preferable to mixing shell and wdl semantics, but YMMV.

Sep 23 '22 14:09 rhpvorderman

If all values were accessible either through the WDL semantics (~{foo}) AND their shell semantics ('${foo}`) it would probably be the most flexibly and lowest impact to the user.

This is really interesting, being able to interact with the shell and WDL versions of the variable seems powerful.

Reading through the comments here, I think it would be a lot cleaner to automatically make all inputs available as env variables rather than declaring them explicitly using a keyword or block, but I can see where this could cause headaches potentially overwriting existing variables.

Nov 19 '22 13:11 hkeward

It is possible we are over estimating the likelihood of someone defining variables that will override the shell equivalent. The big one in my mind is PATH, but I am sure there are other variables which could easily be overridden.

That being said, it is possible that behaviour could also become desirable. For example if you were using gcloud in a command, you could simply set an input called File GOOGLE_APPLICATION_CREDENTIALS and your gcloud would automatically authenticate with the right credentials.

If we went with the approach of making every input accessible as an env variable, we may want to have a way to exclude an input from being converted

Nov 19 '22 13:11 patmagee

(My 2p) If we were starting WDL over again from scratch, then I'd certainly make environment variables the default...but today I think the feature has enough practical utility to introduce in a backwards-compatible way in WDL 1.x...such as env String, EnvString, etc.

Nov 20 '22 21:11 mlin

Prototyping the env declaration prefix: https://github.com/chanzuckerberg/miniwdl/pull/617

Nov 21 '22 00:11 mlin

@mlin i just say, I marvel at how little code it takes to add a new feature to miniwdl... and 1/2 of the code is for the test!

Nov 21 '22 00:11 patmagee

Included this feature in miniwdl v1.8.0 version development; def invite anyone interested to hammer on it, can still change.

Nov 25 '22 10:11 mlin

@mlin I tried out this feature, and I have to say, really love it as defined. Honestly I am 100% happy with it :D

Dec 22 '22 15:12 patmagee

I think this is ready for final review, and if everything looks good we can get it voted on

Dec 22 '22 16:12 patmagee

wdl wdl copied to clipboard

Support Defining Environment Variables Within a Task

Declare Environment Variables within a Task

Checklist

Option 1: Keep the block in a task but require values to be set

Option 2: Use a declared type

Option 3: add some sort of qualifier to the type

wdl
wdl copied to clipboard