cylc-uiserver icon indicating copy to clipboard operation
cylc-uiserver copied to clipboard

efficiency: investigate bottleneck

Open oliver-sanders opened this issue 1 year ago • 5 comments

See also: https://github.com/cylc/cylc-ui/issues/1614

This workflow has proven to be remarkably difficult for the UIS & UI to handle:

#!Jinja2

{% set members = 10 %}
{% set hours = 100 %}

[scheduler]
    allow implicit tasks = True

[task parameters]
    member = 0..{{members}}
    fcsthr = 0..{{hours}}
  [[templates]]
    member = member%(member)03d
    fcsthr = _fcsthr%(fcsthr)03d

[scheduling]
  initial cycle point = 2000
  runahead limit = P3
  [[xtriggers]]
    start = wall_clock(offset=PT7H15M)
  [[graph]]
    T00,T06,T12,T18 = """
        @start & prune[-PT6H]:finish => prune & purge
        @start => sniffer:ready_<member,fcsthr> => <member,fcsthr>_process? => finish
        <member,fcsthr>_process:fail? => fault
      """

[runtime]
    [[sniffer]]
        [[[outputs]]]
{% for member in range(0, members + 1) %}
    {% for hour in range(0, hours + 1) %}
            ready_member{{ member | pad(3, 0) }}_fcsthr{{ hour | pad(3, 0) }} = {{ member }}{{ hour }}
    {% endfor %}
{% endfor %}

For more information see: https://cylc.discourse.group/t/slow-load-of-cylc-workflows-disconnects/823/19

Investigation so far has confirmed:

  • The scheduler is not a source of delay.
  • The UIS chokes on the update for several seconds.
    • During this time, updates to other workflows are suspended
  • The UI chokes on the deltas of several seconds.
  • The browser takes a couple of seconds to update.

This issue focuses on the UIS side of things.

Suggested remediation (Flow/UIS only, please update with new suggestions):

  • https://github.com/cylc/cylc-uiserver/pull/548
    • Optimize a routine in the graphql-ws library which was causing ~11s of CPU time.
  • https://github.com/cylc/cylc-uiserver/issues/538
    • Allow updates for other workflows to get through whilst the UIS is choking on a big update.
  • https://github.com/cylc/cylc-uiserver/issues/545
    • Switch to a faster async framework.
  • https://github.com/cylc/cylc-uiserver/issues/333
    • Update our graphql dependencies (currently blocked).
    • This would allow us to move to more recent Python versions and take advantage of improvements in graphene/graphql-core.
    • This requires dealing with some trouble dependencies which is getting quite urgent anyway.
  • https://github.com/cylc/cylc-flow/issues/5907
    • A small cylc-flow side change to make the resolvers cylc-uiserver imports slightly more efficient.

oliver-sanders avatar Jan 04 '24 16:01 oliver-sanders

Early investigations using https://github.com/cylc/cylc-uiserver/pull/546:

  • As expected, there is no single source of the problem :(
  • Resolvers
    • ~10% of CPU time for my example was used by cylc.flow.network.graphql.IgnoreFieldMiddleware.resolve and it's callchain.
    • This is the cost of actually computing the response, but it's only ~10% of the CPU time?!
    • iscoroutinefunction called here took a reasonable amount of time, it's possible that upgrading our dependencies may remove the requirement to check for synchronous functions?
  • Asyncio
    • Asycio is processing a very large number of tasks and taking quite a while doing it.
    • The ensure_future call (now deprecated), which calls create_task which took a whopping ~15% of the CPU time in my example.
    • I think these tasks are resolvers for individual fields, hence why there are so many of them.
    • Uvloop has much better throughput characteristics so may help here https://github.com/cylc/cylc-uiserver/issues/545.
  • Promise
    • It's hard to tell, but it looks like the overheads of the Promise library may be reasonable.
    • Promise is an async library used by graphql-ws (and other dependencies?), I think as a stopgap solution in a time when Python async support wasn't what it is now.
    • I expect there may be an improvement to cutting this out (especially in combination with uvloop).
  • to_snake_case
    • This simple method gets hammered (10's of thousands of calls).
    • Ideally we would reduce the number of calls, but caching would also help.
    • The implementation can be improved too as it's compiling a regex for every call (upstream change).
  • Tokens
    • ~3% of runtime was spent tokenising/detokenising IDs.
    • Not the most expensive routines by a long shot, but low hanging fruit.
    • With this PR: https://github.com/cylc/cylc-flow/pull/5769, Tokens objects will be __hash__'able opening the door to caching.

oliver-sanders avatar Jan 04 '24 16:01 oliver-sanders

The promise is gone when I get around to (hopefully soon) upgrading Graphene, and rewriting our own graphql-ws (as this uses promises too) .. It appears, promise was created by the original developer of Graphene who (I assume) also (in addition to async support) found it a useful way of translating graphql libraries of javascript to python equivalent (could be wrong)..

dwsutherland avatar Feb 06 '24 23:02 dwsutherland

The other thing we do is recursively sift through the GraphQL query result (in the middleware) and strip out all the null fields... Not exactly sure how expensive this is.. Think I use the promise library for this.

dwsutherland avatar Feb 07 '24 01:02 dwsutherland

to_snake_case

This simple method gets hammered (10's of thousands of calls).

Sorry if I've missed relevant discussions elsewhere (just back from leave) - but can we just rename all affected variables using JS conventions? It'll look ugly in a Python program, but the payoff sounds big.

hjoliver avatar Feb 09 '24 00:02 hjoliver

We could, but it would be easier to just cache the output of the method or re-write the code so that method doesn't get hammered, there's a separate issue for this in cylc-flow.

oliver-sanders avatar Feb 09 '24 10:02 oliver-sanders