cromwell icon indicating copy to clipboard operation
cromwell copied to clipboard

Workflow failure: "Workflow is making no progress but has the following unstarted job keys"

Open freeseek opened this issue 3 years ago • 3 comments

After running a large workflow on GCS with ~2,500 tasks, rather than the workflow transitioning from running to success, I received the following error:

  "status": "Failed",
  "failures": [
    {
      "message": "Workflow is making no progress but has the following unstarted job keys: \nScatterCollectorKey_PortBasedGraphOutputNode_xxx.yyy:-1:1\nConditionalCollectorKey_PortBasedGraphOutputNode_xxx.yyy:-1:1",
      "causedBy": []
    }
  ],

The xxx.yyy output variable is from a task being scattered and defined as follows:

task xxx {
  ...
  output {
    ...
    File? yyy = if defined(zzz) then ... else None
  }
}

With zzz not defined.

Despite the error, the job seemed to have completed successfully. However the files were not moved into the final_workflow_outputs_dir as they were supposed to, causing an unwelcome inconvenience.

This problem has also been reported about six months ago in the Terra forum.

The job run with CallCaching activated but no entries in the cache were present before the job started. The only event of notice was that at some point Cromwell crashed due to high memory demand (while trying to retrieve the metadata for the workflow) but, after I restarted it, the workflow proceeded without issues. The workflow is a version development WDL, as can be evinced from the use of the None keyword.

freeseek avatar Mar 31 '21 02:03 freeseek

I've also been running into the same issue. Did you ever find a workaround?

timchu90 avatar May 24 '22 14:05 timchu90

@timchu90 I can't speak for OP, but last I heard, Cromwell supports WDL version 1.0 best. Try running a workflow with only version 1.0 syntax.

aofarrel avatar Jul 07 '22 19:07 aofarrel

We've also recently encountered this issue on large workflows using WDL 1.0. Same behavior as Giulio reported above: all of the outputs are present in their respective execution buckets but are never moved to the output bucket as the workflow reports status Failed despite all tasks succeeding.

RCollins13 avatar Aug 03 '22 14:08 RCollins13