toil icon indicating copy to clipboard operation
toil copied to clipboard

toil-wdl-runner does not respect resource values below Toil defaults

Open stxue1 opened this issue 1 year ago • 0 comments

If a WDL workflow has

runtime {
    memory: "1 GB"
}

toil-wdl-runner will try to start a worker with 2147483648 bytes of memory. For example, on mesos/kubernetes, this can cause the workflow to hang if there aren't enough resources:

[2024-02-09T17:17:57+0000] [Process IO] [D] [toil.batchSystems.mesos.batchSystem] Offer eeb24943-43fd-443b-8acc-1212b0229df1-O6 not suitable to run the tasks with requirements {'wallTime': 0
, 'memory': 2147483648, 'cores': 1, 'disk': 10000000000, 'preemptible': False}. Mesos offered 1030750208.0 memory, 1.0 cores and 42561699840.0 of disk on a non-preemptible agent.

Or when there are enough resources:

[2024-02-09T18:30:36+0000] [MainThread] [D] [toil.fileStores.cachingFileStore] Actually running job ('WDLWorkflowNodeJob' call-indexReference 42db2848-3d68-414f-bca6-997fa27a53d4 v5)
 with ID (42db2848-3d68-414f-bca6-997fa27a53d4) which wants 2147483648 of our 42189520896 bytes.

It seems like toil-wdl-runner will only respect values above the defaults, as changing the defaults to a lower number will result in the right amount of memory being requested.


Looking at the logs, it seems like toil-wdl-runner first creates a WDLRootJob:

[2024-02-09T10:19:27-0800] [Thread-3 (statsAndLoggingAggregator)] [D] [toil.statsAndLogging] Log from job "WDLRootJob" follows:

which tries to get the default amount of resources (disk shown instead of memory as there is no memory log)

	[2024-02-09T10:19:22-0800] [MainThread] [D] [toil.fileStores.cachingFileStore] Total job disk requirement size: 2147483648

and tries to chain the next job with its requested amount of resources

[2024-02-09T10:19:22-0800] [MainThread] [I] [toil.worker] Chaining from 'WDLWorkflowNodeJob' call-indexReference kind-WDLRootJob/instance-28lqv2w8 v6 to 'WDLTaskJob' testWorkflow.ind
exReference kind-WDLTaskJob/instance-ax6us87q v1

and I think queue instead of chain if more is needed:

	[2024-02-09T17:59:15+0000] [MainThread] [D] [toil.worker] We need more memory for the next job, so finishing

Maybe the issue is similar to this in the sense that instead of getting the resource values initially, toil-wdl-runner always tries to start an initial job with the default amount of resources and increase as necessary from there?

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1496

stxue1 avatar Feb 09 '24 18:02 stxue1