cromwell
cromwell copied to clipboard
Workflow failure: "Workflow is making no progress but has the following unstarted job keys"
After running a large workflow on GCS with ~2,500 tasks, rather than the workflow transitioning from running to success, I received the following error:
"status": "Failed",
"failures": [
{
"message": "Workflow is making no progress but has the following unstarted job keys: \nScatterCollectorKey_PortBasedGraphOutputNode_xxx.yyy:-1:1\nConditionalCollectorKey_PortBasedGraphOutputNode_xxx.yyy:-1:1",
"causedBy": []
}
],
The xxx.yyy
output variable is from a task being scattered and defined as follows:
task xxx {
...
output {
...
File? yyy = if defined(zzz) then ... else None
}
}
With zzz
not defined.
Despite the error, the job seemed to have completed successfully. However the files were not moved into the final_workflow_outputs_dir
as they were supposed to, causing an unwelcome inconvenience.
This problem has also been reported about six months ago in the Terra forum.
The job run with CallCaching activated but no entries in the cache were present before the job started. The only event of notice was that at some point Cromwell crashed due to high memory demand (while trying to retrieve the metadata for the workflow) but, after I restarted it, the workflow proceeded without issues. The workflow is a version development
WDL, as can be evinced from the use of the None
keyword.
I've also been running into the same issue. Did you ever find a workaround?
@timchu90 I can't speak for OP, but last I heard, Cromwell supports WDL version 1.0 best. Try running a workflow with only version 1.0 syntax.
We've also recently encountered this issue on large workflows using WDL 1.0. Same behavior as Giulio reported above: all of the outputs are present in their respective execution buckets but are never moved to the output bucket as the workflow reports status Failed
despite all tasks succeeding.