artiq
artiq copied to clipboard
Scheduler strict priority option
ARTIQ Feature Request
Problem this request addresses
Currently, the scheduler only looks at prepared experiments when deciding which one to run()
next. While this behavior makes sense in terms of maximizing the use of the core device in terms of wall-clock time, it doesn't guarantee strict priority enforcement of all experiments in the pipeline (i.e. experiments that may still be pending or preparing).
Example scenario: Experiments A and B are scheduled with RIDs 1 and 2 respectively, and with the same priority, let's say priority = 0. Experiment A will prepare, then run, and B will prepare while A is running. Suppose now another experiment, C, is submitted with priority = 1. C will take precedence over B (which I'm assuming is at prepare_done now) and start preparing, but if A finishes running before C finishes preparing, then B will run before C even though it has a lower priority. This example is somewhat of an edge case, but it is simplest demonstration of this possibly undesired behavior - there are more realistic cases in which this could occur. It has become an issue for us as we've started to create experiments that submit other (higher priority) experiments while they're running.
Describe the solution you'd like
IMO the most obvious/intuitive, but also probably the most intrusive solution would be to add an optional flag (set to False by default, of course, so as not to silently change the scheduler behavior) when starting the scheduler for strict_priority
or something to that effect. If the flag is True, then when the scheduler decides what to run next, it will look at pending/preparing experiments in addition to prepare_done and, if there is an experiment in the pipeline that would take precedence over any prepare_done experiments, then the scheduler will wait for that experiment to become ready to run.
Another option would be to modify the behavior of the flush flag. The current behavior actually might be considered a bug - there isn't much documentation on the flush flag so I'm not sure exactly what the intended behavior is. Currently, once an experiment enters the flushing "stage", it prevents any experiments behind it in the pipeline (even experiments with the same priority, but a higher RID) from preparing (and thus from running). That also includes higher priority experiments that are submitted after the first experiment enters the flushing stage. My proposed change would make the flushing stage non-blocking, i.e. stop it from preventing same/higher priority experiments from entering the prepare stage. How this relates to strict priority scheduling: if the user were to set flush=True for all experiments (or at least all experiments they want to guarantee strict scheduling for), then this non-blocking behavior would make it so that experiments which are submitted while another experiment is running would all accumulate in a sort of "queue" of flushing experiments, and then once the first experiment finished running they would prepare, and subsequently run, in strict priority order.
Additional context
While I did say that adding a flag to the scheduler seemed like the most intuitive option to me, I think the best solution in terms of efficacy and minimizing changes to the scheduler would be to change/fix the flushing behavior. It seems unlikely to me that many users (if any) are depending on the current behavior, although if I'm wrong about that then of course I would reconsider my opinion.
Thanks for posting this, @b-bondurant. I've seen similar issues locally at UMD. @sbourdeauducq @dnadlinger.
Possibly related: 966ed5d0135cd32f7f4cdbba049cc28a394c6884 by @dnadlinger.
My referenced commit shouldn't be related, as it only fixed cases where runs were mistakenly not prepared at all (whereas here, the issue is with the intended priority semantics of the scheduler).
Another, very simple solution would be to add a mode in which the prepare
phase is skipped entirely, and prepare()
is just called when the experiment runs.
I wonder whether flush is actually in use (perhaps at NIST)? I've been avoiding to think about changing its behaviour for exactly the reasons you mention – it's badly documented, and we aren't actually using it at all.
Another, very simple solution would be to add a mode in which the prepare phase is skipped entirely, and prepare() is just called when the experiment runs.
Yeah, that sounds very similar to the behavior I was describing, but more explicit than using the flush flag which is nice.
One characteristic that both methods share, though, is the subversion of the pipelining. In the scenario I'm running into, it's really just the run
phase that I care about running in strict priority order, so there really isn't any need to prevent the rest of the pipeline from operating the way it currently does. However for completely strict priority order (i.e. including the prepare
phase), something like what you're suggesting seems necessary. And we might even consider including the analyze
phase as well, effectively removing all pipelining from the scheduler.
In the original design discussions for ARTIQ, the purpose of flush
was to ensure that no experiments were prepared during the run of the preceding experiment, for example if you want to guarantee that dataset values modified by the running experiment were fully updated before any subsequent experiments pulled their values in their prepare()
stage. This issue of experiments being prepared with old values of datasets before the preceding experiments can finish updating them is perhaps less important now, with the ability to store some values on the core device that persist across kernels, but in general it is handy. There is a time cost for the loss of pipelining, of course. At the time, we were not really considering the case that @b-bondurant is describing, which is certainly valid. But hopefully this sheds some light on the rationale for the current flush
behavior.
I think the idea of a strict_priority
flag for the scheduler that considers both experiments that are awaiting prepare, as well as have prepare_done, seems like a reasonable option (defaulting to False).
ping @b-bondurant is a strict_priority
flag still something that feels important?
@dhslichter oops, sorry for letting this thread die. I developed a workaround that we're pretty happy with - although I think it wouldn't actually be relevant for the specific example scenario I described since it requires an explicit call in order for a lower-priority experiment to give way to higher priority ones.
In general I think a strict_priority
flag in the scheduler itself could still be useful, but afaik it's not something we desperately need anymore. For any experiments that we know might need to be superceded, we can just use the above workaround.