wdl icon indicating copy to clipboard operation
wdl copied to clipboard

Undefined behavior when turning coerced optional `File?` to null + Clarification about where String to File coercion takes place

Open stxue1 opened this issue 4 days ago • 0 comments

I originally encountered these issues at https://github.com/chanzuckerberg/miniwdl/issues/696.

One thing the WDL spec is vague about is how a task should coerce string to file. The spec says that all non-output declarations must run prior to the command section. My implicit understanding is that the output declaration will be running under a different directory than the rest of the task. It sounds like the output declarations are running in the current directory under the host machine, while the output section is running in the current directory inside the container.

For example:

task test {
  input {
    File f_input = "test.txt"
  }
  command <<<printf "hello" > test.txt>>>
  File f_body = "test.txt"
  output {
    File f_output = "test.txt"
  }
}

Assuming all files exist, it's implicitly assumed that f_input and f_body will point to some file on the host machine, but f_output will point to the file inside the container. Maybe this should be clarified in the SPEC, as it is not immediately obvious.

Another issue that arose when testing around with miniwdl is that there can be inconsistent behavior with coerced optional files.

Given the WDL workflow:

version 1.1
workflow testWorkflow {
  input {
  }
  call testTask
  output {
    Array[File?] array_in_output = testTask.array_in_output
    Int len_in_output = testTask.len_in_output
    Array[File?] array_in_body_out = testTask.array_in_body_out
    Int len_in_body_out = testTask.len_in_body_out
    Array[File?] array_in_input_out = testTask.array_in_input_out
    Int len_in_input_out = testTask.len_in_input_out
  }
}

task testTask {
  input {
    Array[File?] array_in_input = ["example1.txt", "example2.txt"]
    Int len_in_input = length(select_all(array_in_input))
  }
  command <<<>>>
  Array[File?] array_in_body = ["example1.txt", "example2.txt"]
  Int len_in_body = length(select_all(array_in_body))
  output {
    Array[File?] array_in_output = ["example1.txt", "example2.txt"]
    Int len_in_output = length(select_all(array_in_output))
    Array[File?] array_in_body_out = array_in_body
    Int len_in_body_out = len_in_body
    Array[File?] array_in_input_out = array_in_input
    Int len_in_input_out = len_in_input
  }
}

The spec says that optional file types at task outputs will be coerced to null.

For one, is there a reason why this scope is limited to just task outputs and not workflow outputs?

Additionally, because the spec says this null coercion is applied at the output step, given that the files example1.txt and example2.txt don't exist, the assumed correct output for the WDL workflow above is:

{
  "dir": "/home/heaucques/Documents/wdl-conformance-tests/20240626_184902_testWorkflow",
  "outputs": {
    "testWorkflow.array_in_body_out": [
      null,
      null
    ],
    "testWorkflow.array_in_input_out": [
      null,
      null
    ],
    "testWorkflow.array_in_output": [
      null,
      null
    ],
    "testWorkflow.len_in_body_out": 2,
    "testWorkflow.len_in_input_out": 2,
    "testWorkflow.len_in_output": 0
  }
}

Because the null coercion happens at the task output, the select_all function calls all will return different values depending on what part of the section it is called in; the body will return ["example1.txt", "example2.txt"], giving a length of 2. However, for the task output declaration, the function select_all will return [null, null], giving a length of 0. Since this can be counterintuitive as one may expect that a nonexistent file will always not be counted in a select_all call, is this the expected behavior, or what should the expected behavior be?

stxue1 avatar Jul 02 '24 20:07 stxue1