artiq icon indicating copy to clipboard operation
artiq copied to clipboard

Scheduler strict priority option

Open b-bondurant opened this issue 4 years ago • 6 comments

ARTIQ Feature Request

Problem this request addresses

Currently, the scheduler only looks at prepared experiments when deciding which one to run() next. While this behavior makes sense in terms of maximizing the use of the core device in terms of wall-clock time, it doesn't guarantee strict priority enforcement of all experiments in the pipeline (i.e. experiments that may still be pending or preparing).

Example scenario: Experiments A and B are scheduled with RIDs 1 and 2 respectively, and with the same priority, let's say priority = 0. Experiment A will prepare, then run, and B will prepare while A is running. Suppose now another experiment, C, is submitted with priority = 1. C will take precedence over B (which I'm assuming is at prepare_done now) and start preparing, but if A finishes running before C finishes preparing, then B will run before C even though it has a lower priority. This example is somewhat of an edge case, but it is simplest demonstration of this possibly undesired behavior - there are more realistic cases in which this could occur. It has become an issue for us as we've started to create experiments that submit other (higher priority) experiments while they're running.

Describe the solution you'd like

IMO the most obvious/intuitive, but also probably the most intrusive solution would be to add an optional flag (set to False by default, of course, so as not to silently change the scheduler behavior) when starting the scheduler for strict_priority or something to that effect. If the flag is True, then when the scheduler decides what to run next, it will look at pending/preparing experiments in addition to prepare_done and, if there is an experiment in the pipeline that would take precedence over any prepare_done experiments, then the scheduler will wait for that experiment to become ready to run.

Another option would be to modify the behavior of the flush flag. The current behavior actually might be considered a bug - there isn't much documentation on the flush flag so I'm not sure exactly what the intended behavior is. Currently, once an experiment enters the flushing "stage", it prevents any experiments behind it in the pipeline (even experiments with the same priority, but a higher RID) from preparing (and thus from running). That also includes higher priority experiments that are submitted after the first experiment enters the flushing stage. My proposed change would make the flushing stage non-blocking, i.e. stop it from preventing same/higher priority experiments from entering the prepare stage. How this relates to strict priority scheduling: if the user were to set flush=True for all experiments (or at least all experiments they want to guarantee strict scheduling for), then this non-blocking behavior would make it so that experiments which are submitted while another experiment is running would all accumulate in a sort of "queue" of flushing experiments, and then once the first experiment finished running they would prepare, and subsequently run, in strict priority order.

Additional context

While I did say that adding a flag to the scheduler seemed like the most intuitive option to me, I think the best solution in terms of efficacy and minimizing changes to the scheduler would be to change/fix the flushing behavior. It seems unlikely to me that many users (if any) are depending on the current behavior, although if I'm wrong about that then of course I would reconsider my opinion.

b-bondurant avatar Nov 17 '20 22:11 b-bondurant

Thanks for posting this, @b-bondurant. I've seen similar issues locally at UMD. @sbourdeauducq @dnadlinger.

Possibly related: 966ed5d0135cd32f7f4cdbba049cc28a394c6884 by @dnadlinger.

drewrisinger avatar Dec 04 '20 17:12 drewrisinger

My referenced commit shouldn't be related, as it only fixed cases where runs were mistakenly not prepared at all (whereas here, the issue is with the intended priority semantics of the scheduler).

Another, very simple solution would be to add a mode in which the prepare phase is skipped entirely, and prepare() is just called when the experiment runs.

I wonder whether flush is actually in use (perhaps at NIST)? I've been avoiding to think about changing its behaviour for exactly the reasons you mention – it's badly documented, and we aren't actually using it at all.

dnadlinger avatar Dec 04 '20 18:12 dnadlinger

Another, very simple solution would be to add a mode in which the prepare phase is skipped entirely, and prepare() is just called when the experiment runs.

Yeah, that sounds very similar to the behavior I was describing, but more explicit than using the flush flag which is nice.

One characteristic that both methods share, though, is the subversion of the pipelining. In the scenario I'm running into, it's really just the run phase that I care about running in strict priority order, so there really isn't any need to prevent the rest of the pipeline from operating the way it currently does. However for completely strict priority order (i.e. including the prepare phase), something like what you're suggesting seems necessary. And we might even consider including the analyze phase as well, effectively removing all pipelining from the scheduler.

b-bondurant avatar Dec 04 '20 18:12 b-bondurant

In the original design discussions for ARTIQ, the purpose of flush was to ensure that no experiments were prepared during the run of the preceding experiment, for example if you want to guarantee that dataset values modified by the running experiment were fully updated before any subsequent experiments pulled their values in their prepare() stage. This issue of experiments being prepared with old values of datasets before the preceding experiments can finish updating them is perhaps less important now, with the ability to store some values on the core device that persist across kernels, but in general it is handy. There is a time cost for the loss of pipelining, of course. At the time, we were not really considering the case that @b-bondurant is describing, which is certainly valid. But hopefully this sheds some light on the rationale for the current flush behavior.

I think the idea of a strict_priority flag for the scheduler that considers both experiments that are awaiting prepare, as well as have prepare_done, seems like a reasonable option (defaulting to False).

dhslichter avatar Dec 05 '20 00:12 dhslichter

ping @b-bondurant is a strict_priority flag still something that feels important?

dhslichter avatar Sep 14 '21 15:09 dhslichter

@dhslichter oops, sorry for letting this thread die. I developed a workaround that we're pretty happy with - although I think it wouldn't actually be relevant for the specific example scenario I described since it requires an explicit call in order for a lower-priority experiment to give way to higher priority ones.

In general I think a strict_priority flag in the scheduler itself could still be useful, but afaik it's not something we desperately need anymore. For any experiments that we know might need to be superceded, we can just use the above workaround.

b-bondurant avatar Sep 14 '21 17:09 b-bondurant