cylc-flow
cylc-flow copied to clipboard
High CPU usage
The workflow below is a much simplified version of a real case. Running this workflow results in the scheduler constantly using 100% CPU until stopped. Tested on 7.8.4 & master.
The CPU usage reduces if you reduce the number of DATES or the max active cylc points.
The problem also seems to disappear if you remove the [[[R1/{{date}}+PT36H]]]
graph section (which I find surprising).
#!jinja2
{% set DATES = ['20170712T0000Z', '20170725T0000Z', '20170729T0000Z', '20170804T0000Z', '20170809T0000Z', '20170812T0000Z', '20170818T0000Z', '20170823T0000Z', '20170902T0000Z', '20170919T0000Z', '20170923T0000Z', '20170930T0000Z', '20171011T0000Z', '20171020T0000Z', '20171028T0000Z', '20171101T0000Z', '20171104T0000Z', '20171115T0000Z', '20171120T0000Z', '20171129T0000Z', '20171207T0000Z', '20171215T0000Z', '20171220T0000Z', '20171224T0000Z', '20171227T0000Z', '20180104T0000Z', '20180109T0000Z', '20180120T0000Z', '20180129T0000Z', '20180206T0000Z', '20180209T0000Z', '20180214T0000Z', '20180217T0000Z', '20180302T0000Z', '20180310T0000Z', '20180313T0000Z', '20180318T0000Z', '20180322T0000Z', '20180325T0000Z', '20180328T0000Z', '20180402T0000Z', '20180408T0000Z', '20180411T0000Z', '20180415T0000Z', '20180418T0000Z', '20180421T0000Z', '20180426T0000Z', '20180429T0000Z', '20180503T0000Z', '20180506T0000Z', '20180510T0000Z', '20180513T0000Z', '20180519T0000Z', '20180522T0000Z', '20180527T0000Z', '20180603T0000Z', '20180607T0000Z', '20180610T0000Z', '20180613T0000Z', '20180619T0000Z', '20180706T0000Z', '20180711T0000Z', '20180714T0000Z', '20180718T0000Z', '20180721T0000Z', '20180726T0000Z', '20180802T0000Z', '20180809T0000Z', '20180813T0000Z', '20180816T0000Z', '20180821T0000Z', '20180829T0000Z', '20180904T0000Z', '20180907T0000Z', '20180910T0000Z', '20181004T0000Z', '20181013T0000Z', '20181019T0000Z', '20181023T0000Z', '20181027T0000Z', '20181030T0000Z', '20181104T0000Z', '20181107T0000Z', '20181110T0000Z', '20181115T0000Z', '20181118T0000Z', '20181122T0000Z', '20181128T0000Z', '20181203T0000Z', '20181206T0000Z', '20181210T0000Z', '20181215T0000Z', '20181219T0000Z', '20181224T0000Z', '20181228T0000Z', '20190102T0000Z', '20190109T0000Z', '20190112T0000Z', '20190119T0000Z', '20190124T0000Z', '20190131T0000Z', '20190208T0000Z', '20190211T0000Z', '20190214T0000Z', '20190217T0000Z', '20190222T0000Z', '20190226T0000Z', '20190301T0000Z', '20190308T0000Z', '20190316T0000Z', '20190325T0000Z', '20190329T0000Z', '20190404T0000Z', '20190408T0000Z', '20190411T0000Z', '20190419T0000Z', '20190422T0000Z', '20190426T0000Z', '20190429T0000Z', '20190503T0000Z', '20190507T0000Z', '20190510T0000Z', '20190513T0000Z', '20190521T0000Z', '20190524T0000Z', '20190527T0000Z', '20190530T0000Z', '20190602T0000Z', '20190605T0000Z', '20190608T0000Z', '20190612T0000Z', '20190617T0000Z', '20190623T0000Z', '20190626T0000Z', '20190629T0000Z', '20190702T0000Z', '20190709T0000Z'] %}
[cylc]
UTC mode = True
[scheduling]
initial cycle point = {{DATES|sort|first}}
final cycle point = {{DATES|sort|last}}
max active cycle points = 6
[[dependencies]]
[[[ R1 ]]]
graph = "install_cold"
{%- for date in DATES|sort : %}
[[[ R1/{{date}}]]]
graph = "install_cold[^] => fcst => process"
[[[ R36/{{date}}+PT1H/PT1H ! (T01,T02,T03,T04,T05,T07,T08,T09,T10,T11,T13,T14,T15,T16,T17,T19,T20,T21,T22,T23)]]]
graph = "process[-PT1H] => process"
[[[ R36/{{date}}+PT1H/PT1H ! (T00, T06, T12, T18) ]]]
graph = "process[-PT1H] => process"
[[[ R1/{{date}}+PT36H ]]]
graph = "process => housekeep"
{%- endfor %}
[runtime]
[[install_cold,fcst,process,housekeep]]
script = "false"
@dpmatthews: note; until we remove Cylc7 machinery (i.e. state_summary_mgr), or manage to disable, it we can’t gauge the extent of the load at Cylc8..
@hjoliver I wonder how SoD will handle this workflow.
This suite (with 2 tasks per cycle) also takes a long time to validate 15s on my laptop, vs 2s for the 1000 task per cycle complex suite).
Basically it generates a pathological number of graph recurrences (550?) that presumably require loads of ongoing isodatetime computation. It is odd that the wheels are still turning when the suite is stalled though, so probably worth profiling to see exactly what is going on there, in case we are doing any unnecessary computation.
@dwsutherland - I'd expect the load to be similar on master if this is all about the isodatetime computations. The old state_summary
bits shouldn't make much difference. And I don't think SoD will make much difference here as there are very few tasks in the suite
Is this still a problem?
It probably is still a problem. We need to do some profiling.
The workflow results in a lot of different recurrences (note each TXX
exclusion is a separate sequence). These get iterated over a fair bit.
Cylc used to scale very badly to the number of recurrences, here's an old PR which improved the situation, the tests might be helpful: https://github.com/cylc/cylc-flow/pull/2322