cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

High CPU usage

Open dpmatthews opened this issue 4 years ago • 5 comments

The workflow below is a much simplified version of a real case. Running this workflow results in the scheduler constantly using 100% CPU until stopped. Tested on 7.8.4 & master.

The CPU usage reduces if you reduce the number of DATES or the max active cylc points. The problem also seems to disappear if you remove the [[[R1/{{date}}+PT36H]]] graph section (which I find surprising).

#!jinja2
{% set DATES = ['20170712T0000Z', '20170725T0000Z', '20170729T0000Z', '20170804T0000Z', '20170809T0000Z', '20170812T0000Z', '20170818T0000Z', '20170823T0000Z', '20170902T0000Z', '20170919T0000Z', '20170923T0000Z', '20170930T0000Z', '20171011T0000Z', '20171020T0000Z', '20171028T0000Z', '20171101T0000Z', '20171104T0000Z', '20171115T0000Z', '20171120T0000Z', '20171129T0000Z', '20171207T0000Z', '20171215T0000Z', '20171220T0000Z', '20171224T0000Z', '20171227T0000Z', '20180104T0000Z', '20180109T0000Z', '20180120T0000Z', '20180129T0000Z', '20180206T0000Z', '20180209T0000Z', '20180214T0000Z', '20180217T0000Z', '20180302T0000Z', '20180310T0000Z', '20180313T0000Z', '20180318T0000Z', '20180322T0000Z', '20180325T0000Z', '20180328T0000Z', '20180402T0000Z', '20180408T0000Z', '20180411T0000Z', '20180415T0000Z', '20180418T0000Z', '20180421T0000Z', '20180426T0000Z', '20180429T0000Z', '20180503T0000Z', '20180506T0000Z', '20180510T0000Z', '20180513T0000Z', '20180519T0000Z', '20180522T0000Z', '20180527T0000Z', '20180603T0000Z', '20180607T0000Z', '20180610T0000Z', '20180613T0000Z', '20180619T0000Z', '20180706T0000Z', '20180711T0000Z', '20180714T0000Z', '20180718T0000Z', '20180721T0000Z', '20180726T0000Z', '20180802T0000Z', '20180809T0000Z', '20180813T0000Z', '20180816T0000Z', '20180821T0000Z', '20180829T0000Z', '20180904T0000Z', '20180907T0000Z', '20180910T0000Z', '20181004T0000Z', '20181013T0000Z', '20181019T0000Z', '20181023T0000Z', '20181027T0000Z', '20181030T0000Z', '20181104T0000Z', '20181107T0000Z', '20181110T0000Z', '20181115T0000Z', '20181118T0000Z', '20181122T0000Z', '20181128T0000Z', '20181203T0000Z', '20181206T0000Z', '20181210T0000Z', '20181215T0000Z', '20181219T0000Z', '20181224T0000Z', '20181228T0000Z', '20190102T0000Z', '20190109T0000Z', '20190112T0000Z', '20190119T0000Z', '20190124T0000Z', '20190131T0000Z', '20190208T0000Z', '20190211T0000Z', '20190214T0000Z', '20190217T0000Z', '20190222T0000Z', '20190226T0000Z', '20190301T0000Z', '20190308T0000Z', '20190316T0000Z', '20190325T0000Z', '20190329T0000Z', '20190404T0000Z', '20190408T0000Z', '20190411T0000Z', '20190419T0000Z', '20190422T0000Z', '20190426T0000Z', '20190429T0000Z', '20190503T0000Z', '20190507T0000Z', '20190510T0000Z', '20190513T0000Z', '20190521T0000Z', '20190524T0000Z', '20190527T0000Z', '20190530T0000Z', '20190602T0000Z', '20190605T0000Z', '20190608T0000Z', '20190612T0000Z', '20190617T0000Z', '20190623T0000Z', '20190626T0000Z', '20190629T0000Z', '20190702T0000Z', '20190709T0000Z'] %}
[cylc]
    UTC mode = True
[scheduling]
    initial cycle point = {{DATES|sort|first}}
    final cycle point = {{DATES|sort|last}}
    max active cycle points = 6
    [[dependencies]]
        [[[ R1 ]]]
            graph = "install_cold"
	{%- for date in DATES|sort : %}
        [[[ R1/{{date}}]]]
            graph = "install_cold[^] => fcst => process"
        [[[ R36/{{date}}+PT1H/PT1H ! (T01,T02,T03,T04,T05,T07,T08,T09,T10,T11,T13,T14,T15,T16,T17,T19,T20,T21,T22,T23)]]]
            graph = "process[-PT1H] => process"
	[[[ R36/{{date}}+PT1H/PT1H ! (T00, T06, T12, T18) ]]]
            graph = "process[-PT1H] => process"
	[[[ R1/{{date}}+PT36H ]]]
            graph = "process => housekeep"
	{%- endfor %}
[runtime]
    [[install_cold,fcst,process,housekeep]]
        script = "false"

dpmatthews avatar Mar 06 '20 17:03 dpmatthews

@dpmatthews: note; until we remove Cylc7 machinery (i.e. state_summary_mgr), or manage to disable, it we can’t gauge the extent of the load at Cylc8..

@hjoliver I wonder how SoD will handle this workflow.

dwsutherland avatar Mar 07 '20 03:03 dwsutherland

This suite (with 2 tasks per cycle) also takes a long time to validate 15s on my laptop, vs 2s for the 1000 task per cycle complex suite).

Basically it generates a pathological number of graph recurrences (550?) that presumably require loads of ongoing isodatetime computation. It is odd that the wheels are still turning when the suite is stalled though, so probably worth profiling to see exactly what is going on there, in case we are doing any unnecessary computation.

@dwsutherland - I'd expect the load to be similar on master if this is all about the isodatetime computations. The old state_summary bits shouldn't make much difference. And I don't think SoD will make much difference here as there are very few tasks in the suite

hjoliver avatar Mar 10 '20 11:03 hjoliver

Is this still a problem?

dwsutherland avatar Mar 10 '21 00:03 dwsutherland

It probably is still a problem. We need to do some profiling.

hjoliver avatar Mar 10 '21 21:03 hjoliver

The workflow results in a lot of different recurrences (note each TXX exclusion is a separate sequence). These get iterated over a fair bit.

Cylc used to scale very badly to the number of recurrences, here's an old PR which improved the situation, the tests might be helpful: https://github.com/cylc/cylc-flow/pull/2322

oliver-sanders avatar Aug 04 '22 16:08 oliver-sanders