wdl icon indicating copy to clipboard operation
wdl copied to clipboard

Proposal: `interpreter` runtime attribute for non-bash tasks

Open jdidion opened this issue 3 years ago • 8 comments

Also see discussions here: https://github.com/openwdl/wdl/discussions/408

Currently, it is specified that the task command block is executed by a bash interpreter. Non-bash languages are supported either by heredocs or by putting the commands in an external script and calling that script from the command block.

This proposal is to add the optional interpreter attribute to the runtime section. This keyword will be used to specify the interpreter program (name, version, etc) used to execute the command block. This information can be used by the WDL runtime in several ways:

  • Execute non-bash scripts without the need for heredocs or external scripts
  • Pre-check to verify that the specified interpreter is available in the runtime environment (container or host system) before trying to execute the commands
  • Choose a default runtime environment (container, virtualenv, etc) when a container is not specified

Example:

task python_cat_file {
  input {
    File infile
  }

  command <<<
  import sys
  path = "~{infile}"
  print(f"reading from file {path}", file=sys.stderr)
  with open(path, "r") as i:
    print(i.read())
  >>>

  output {
    String content = read_string(stdout())
  }

  runtime {
    interpreter: {
      name: "python",
      version: "^3.6.0",
      variant: "pypy"
    }
  }
}

The interpreter keyword can take one of the following values:

  • String with the interpreter name (e.g. "python") and optional version constraint.
    runtime {
      interpreter: "python ^3.6.0"
    }
    
    • The interpreter name is case-insensitive. It is suggested to use all lowercase.
    • For a programming language with multiple interpreters available, the language name (e.g. "python") is an alias for the reference/default interpreter variant (e.g. "cpython").
    • If the version is not specified, the default version is used. The default version may be defined in the WDL specification, or it may be specified by the runtime.
  • Array[String] with multiple interpreter variants, in order of preference.
    runtime {
      interpreter: ["cpython ^3.6.0", "PyPy3.6 ^7.3.0"]
    }
    
  • Object with the following keys:
    • name: The interpreter name. Required.
    • version: An optional version constraint.
    • variant: An optional interpreter variant.
    • args: An optional Array[String] with arguments to pass to the interpreter when executing the command. The runtime may specify interpreter arguments that are always used and/or not allowed - any arguments specified by the args keyword are merged with the required arguments, and an error is raised if any of the user-specified arguments are disallowed.
    runtime {
      interpreter: {
        name: "bash",
        version: "^5.0",
        args: ["--posix"]
      }
    }
    
  • Array[Object] with multiple Object-style interpreter variants.

Additional rules:

  • If no interpreter keyword is specified, the default value is "bash".
  • If the interpreter keyword is specified, the host system must provided an interpreter that is compatible with one of the user's specifications, or raise an error if no compatible interpreters are available.
  • If the user specifies the container runtime attribute, then one of the specified interpreters must be available in that container.
    • If container specifies an array of containers, then the runtime may (but is not required to) use the availability of one of the specified interpreters within the container as a criteria in selecting which of the containers to use.
  • If the user does not specify the container runtime attribute then the runtime may try either or both of the following, in any order. In either case, no other dependencies are guaranteed to exist aside from the interpreter.
    • Execute the commands on the host system by searching for a compatible interpreter
    • Execute the commands in a container of the runtime's choice that provides one of the specified interpreters

Also see discussion here: https://github.com/openwdl/wdl/discussions/408

jdidion avatar Oct 15 '20 19:10 jdidion

I think it is a good addition, but very complex.

What about just using

interpreter: python

Versions and such should be mediated by the container used right? If you set a container a wrong version of python cannot be used. Alternatively python also sets a versioned executable on path i.e. python3.7 so also with containers there should be no issue.

I think the version and interpreter stuff will be very hard to implement given all the edge case and all interpreters out there. Just giving the executable should be much easier.

On the other hand if only interpreter: <interpreter> is added, this proposal does not have much value over the heredoc syntax.

rhpvorderman avatar Nov 10 '20 09:11 rhpvorderman

The things that specifying the interpreter details offers is 1) the ability for the engine to provide a container with a matching interpreter if the user doesn't provide their own container, and/or 2) the ability for the engine to check that an appropriate interpreter exists in the runtime environment. Without those details, the engine can still provide a container/check that the interpreter exists, but it can't deal with the situation where a specific minimum/maximum interpreter is required (i.e. some python code written for v3.6+ won't run on a v3.5 interpreter).

We could get rid of variant and just say that if the user wants some interpreter other than the default one for a given language, they need to supply their own container. Then we can go to interpreter having a string value with an optional version constraint.

jdidion avatar Nov 10 '20 16:11 jdidion

@jdidion thanks for putting this together. I think this would definitely be a use-ability improvement for a lot of tasks that do not depend on the bash interpreter. In the first proposal, I would generally agree with @rhpvorderman that it's quite a complex addition to the task definition in its current iteration (although I would agree, that it would be the most flexible).

I wonder if its possible to simplify this even further and strip out all of the version info and offload that to the container. defining version: 3.7 or version: ^3.7 is irrelevant if the version does not actually exist in the container itself. This would suggestion doing what @rhpvorderman is suggestion with interpreter: python, however this might not be powerful enough.

I would propose we combine both of your suggestions to allow defining the interpreter in two ways:

runtime {
  # Simple definition. Expected behaviour would be executing the code with the default args for that interpreter 
  # according to the engine. 
  interpreter: "python"
}

runtime {
  # Complex definition. modify the default behaviour of the engine
  interpreter: {
    "name": "python",
    "args": [ "...","..."]
  }
}

A question that this brings up is should we define expected behaviour for a set of standard interpreters in WDL directly (such as the flags that those interpreters should be started with)

patmagee avatar Apr 14 '21 16:04 patmagee

So I had the opportunity to get into using a bit of nextflow. How they solve this problem is pretty simple, they simply use a shebang line! We could emulate that behaviour. The default behaviour would still be to use bash if a shebang is not defined.


task a {
   command <<<
     #!/usr/bin/env python
     print("helllo")
   >>>
}

patmagee avatar May 28 '21 13:05 patmagee

Isn't this already documented?

task a {
    command <<<
        python <<CODE
            print('"hello")
        CODE
    >>>
}

The SPEC is literally full of these sorts of examples.

EDIT: Like I said above:

On the other hand if only interpreter: is added, this proposal does not have much value over the heredoc syntax.

Let the interpreter version be handled by the container you are running. python:3.7-slim for instance.

rhpvorderman avatar May 28 '21 13:05 rhpvorderman

@rhpvorderman this is slightly different. That is calling the python command using a heredoc, but the underlyng script is still treated as a bash script. Whereas the shebang is indicating that is should be treated as a python script

patmagee avatar May 28 '21 13:05 patmagee

Could you explain why the heredoc syntax would not work in this case?

I am not in favour of adding extra syntax and complexity to WDL if there is not a clear benefit.

On all these issues I tend to say: make sure you have some sort of container, so you have full control of the interpreter. And either:

  • Run the script in heredoc
  • If that is unwieldly, you probably created a tool. So then make it a proper tool with tests and such and then create a container so you can use one command instead of a script.

If the "only bash" policy forces users to make proper tools instead of making WDLs full of inline lisp scripts, I'd say that is a win for WDL as a language. I come from a snakemake background, and that was truly awful. Users were granted so much freedom that no two snakemake pipelines looked alike. Hence it was impossible to learn. Everyone rolled their own custom solution to the same problems. So I am very conservative when it comes to adding stuff to WDL that even remotely hints at allowing the same freedom.

I get @jdidion 's argument that the execution engine can solve which interpreter is needed. But I think that is already handled by containers. If you run your job outside of a container that is bad practice, and there should not be any syntax to make it seem like running outside a container is a viable option.

rhpvorderman avatar May 28 '21 14:05 rhpvorderman

I agree with @rhpvorderman 's last comment. In the WDL I've worked on the current behavior hasn't been a big enough obstacle to require a change to the WDL language.

I kind of like @patmagee 's suggestion to support a shebang line, but one downside I can see is that it requires the WDL runtime to read the first line of the command block and interpret it, possibly using string interpolation. Placing it outside the block would make it more clearer that the WDL runtime will handle it specially and that it's not really part of the command script.

pshapiro4broad avatar Jun 01 '21 13:06 pshapiro4broad