cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

non-sequential runahead limiting

Open oliver-sanders opened this issue 4 years ago • 10 comments

The old max active cycle points mechanism really referred more to the cycle point range than it did to the number of "active" cycle points. Consequently this feature was merged in with runahead limit.

We should consider whether there is still a use case for non-sequential runahead limiting that existing functionality cannot cater to.

Somewhat related to the outcome of #4256 See also #3874, #3367

Pull requests welcome!

oliver-sanders avatar Jul 20 '21 12:07 oliver-sanders

See also https://github.com/cylc/cylc-flow/issues/3667#issuecomment-705533994 in which it looks like we agreed to name the new non-sequential limit config setting as active cycle point limit, if we implement it

MetRonnie avatar Jul 20 '21 12:07 MetRonnie

Ah, good, I knew there had been an issue about it at some point.

oliver-sanders avatar Jul 20 '21 12:07 oliver-sanders

We should consider whether there is still a use case for non-sequential runahead limiting that existing functionality cannot cater to.

If there is, it's much easier to implement :+1: (just count active cycle points, obviously)

hjoliver avatar Jul 21 '21 11:07 hjoliver

An active point based limit makes more sense from a scheduler activity limiting perspective (which was the original purpose of runahead limiting, according to me :grin: ).

Now, however, we have the ability to do proper cycle point independent task pool limiting (via the default queue, although this needs to be documented and maybe tweaked so we can retain global limiting whilst using mulitple queues).

hjoliver avatar Oct 08 '21 06:10 hjoliver

Removed the question label as I think we have agreed that this would be sensible, moving to "some day" for now until we encounter a use case that pushes it up in priority.

oliver-sanders avatar Apr 22 '22 10:04 oliver-sanders

Hello,

I'm struggling with a flow to run a set of fully independent (non-sequential but dated) tasks. My problem is that I'd like to have, N active jobs, even it there are some of the tasks in retry for a couple minutes.

I'd like to download a wide range set of days with a large dispersion in time (sometimes it would be 100 days and sometimes 1 week). Sometimes the files are not downloaded at the first attempt so we need to retry a couple of times, and sometimes they might not exist. My problem is that I'd like to have, N active jobs, let's say, N=8. So I saw there used to be an option to do it, but unless I'm wrong it was replaced by runahead limit. But it computes the distance in days between my first active job and my last one. And it happens regularly that I get only one active job. So I tried using:

[scheduling] [[queues]] [[[big_jobs_queue]]] limit = 8 members = IBI But it didn't solve my problem by itself... as soon as I have more than the default 4 runahead dates it waits, so I added as well a line with a runahead specification:

runahead limit = P100

This solves partially my problem, but, 100 is just arbitrary... and I'd like to replace this number with a function that is adapted at each request. Also, by doing this I get a huge list of scheduled jobs in my scheduler... So, I'd say it is quite a nasty solution. Is there anything to do it properly?

Thank you in advance.

cgarciamolina avatar Apr 14 '25 10:04 cgarciamolina

Hi,

  • The runahead limit determines how far Cylc will schedule ahead.
  • The [queue]limit determines how many jobs Cylc will submit in parallel.

Combining a high runahead limit with a low queue limit as you have done should work fine.

Note, the runahead limit can be defined as a number of cycles (e.g. 100) or a time duration (e.g. P1D - a period of one day) which might be more convenient.

If this is a "real time" workflow, you might want to use a @wallclock xtrigger. These cause tasks to wait for a specific time before they are submitted. As a side effect of using a @wallclock xtrigger, Cylc will only "schedule" one task beyond the "wallclock" time which may reduce the number of tasks you see in the GUI.

oliver-sanders avatar Apr 14 '25 10:04 oliver-sanders

Hi,

Thanks for your reply, yes indeed, I'm using the @wallclock trigger so I don't actually have 100 tasks... but 50 or so. And thanks for the period of one day, you're right is better in this case.

cgarciamolina avatar Apr 14 '25 11:04 cgarciamolina

I'd like to have, N active jobs, even it there are some of the tasks in retry for a couple minutes.

A retrying task goes back to the waiting state (prior to resubmitting) so an internal queue with a limit of N should do exactly that - other tasks will submit up to N during that wait time.

hjoliver avatar Apr 15 '25 00:04 hjoliver

It might help to play around with a toy workflow like this, to understand the effect of runhead limiting and internal queues:

[scheduler]
    cycle point format = %Y
[scheduling]
    initial cycle point = 2025
    final cycle point = 2500
    runahead limit = P10
    [[queues]]
        [[[one]]]
            limit = 5
            members = downloader
    [[graph]]
        P1Y = "downloader => process"
[runtime]
    [[downloader, process]]
        script = "sleep $((10 + RANDOM % 10))"

Instances of dcwnloader will spawn out to the runahead limit, because with no inter-cycle dependence or clock-triggers to constrain them, they are by definition all ready to run at once.

But then the queue limit constrains how many such "ready to run" tasks are allowed to submit their jobs at once, so you end up with a lot of tasks waiting on the queue.

You could closely match the queue limit and number of runahead cycles, to avoid a large number of waiting tasks, but that could sometimes constrain throughput more than you want if some downloaders take much longer than others to run, or retry, because the runahead limit is based only on the oldest active cycle.

Note also the UIs (GUI and Tui) both allow filtering by task state, so you can filter out waiting tasks to get rid of the UI clutter.

hjoliver avatar Apr 15 '25 00:04 hjoliver