rfcs Stageless: a complete scheduling overhaul

RENDERED

This RFC is now ready for final review. Feedback is welcome from users at all levels! Do you think this design easy-to-understand, useful and correct? How could the proposed API be improved?

TL; DR

Systems: now stored in the World
Exclusive systems: no longer special-cased
System sets: now just pure metadata, define order and conditions of groups of systems
Schedules: now the executable instances of system sets, also stored in the World
Run criteria: replaced by combineable, non-mutating, bool-returning functions
States: now powered by new exclusive system + schedule combo instead of run criteria
Fixed timestep: now powered by new exclusive system + schedule combo instead of run criteria
Commands: now applied with an exclusive system
Stages: yeet
All of the awful bugs with stages, states, and run criteria: replaced with new, yet-to-be-discovered bugs!

Plus, some shiny new toys: running systems in commands and modifying schedules at runtime!

Prototype

iyes_loopless is a Bevy 0.6-compatible third-party implementation of the core ideas of how this RFC handles run criteria and states. Go try it out!

Context

This is the culmination of many months of discussion, work, debate and mind-melting from @maniwani (Joy), @hymm (WrongShoe) and myself; they should be considered equal coauthors on this RFC. If you're curious about the background, check out https://github.com/bevyengine/bevy/discussions/2801 to get a sense of the tangled mess we're trying to clean up.

Special thanks to @Ratysz and the rest of the ECS crew for all of the feedback and guidance throughout the process.

Meta

What is the best way to handle the migration process from an organizational perspective?

Get an RFC approved, and then merge a massive PR developed on a off-org branch?
- Straightforward, but very hard to review
- Risks divergence and merge conflicts
Use an official branch with delegated authority?
- Easier to review, requires delegation, ultimately creates a large PR that needs to be reviewed
- Risks divergence and merge conflicts
Develop bevy_schedule3 on the main branch as a partial fork of bevy_ecs?
- Some annoying machinery to set up
- Requires more delegation and trust to ECS team
- Avoids divergence and merge conflicts
- Clutters the main branch

Jan 14 '22 17:01 alice-i-cecile

I worry that this may end up clunky and frustrating to use without automatic management/insertion of the commands and states flushing systems. I'd love to be proven wrong.

Jan 14 '22 19:01 inodentry

I worry that this may end up clunky and frustrating to use without automatic management/insertion of the commands and states flushing systems.

In all honesty, I agree. I wanted to start with the more straightforward, explicit design for the first draft. I suspect that automatic inference will both be less frustrating and faster for end users though.

Jan 14 '22 20:01 alice-i-cecile

I worry that this may end up clunky and frustrating to use without automatic management/insertion of the flushing systems.

Frustration over stages sacrificing potential concurrency was a big motivator for this effort. IMO explicit flush systems are hands down better than implicit stage flushes.

To clarify, this isn't "manual insertion forever." Automatic insertion can build on top of this. It's just going to require more system annotation (or type system magic), so the schedule builder can know which components any system can affect with commands.

Jan 14 '22 21:01 maniwani

Maybe every system that is added could have a IdLabel(SystemId) added to it. It might simplify the implementation since only ordering between labels would be required.

Jan 15 '22 09:01 bilsen

I'm very excited by this proposal. It both gives us finer grained control over the scheduler and simplify the scheduler and API at the same time.

I think this is simpler. It makes the pitfalls explicit and forces documenting them (as with the flush_commands system).

Jan 15 '22 09:01 nicopap

Hmm, I personally don't like the use of the word criterion as singular of criteria. I think here we can find good reasons to use criteria as singular.

Jan 15 '22 11:01 Nilirad

A note about before and after. In order to make this work as expected (a before b before c would imply a before c), it is not sufficient to simply add a dependency between a and b and one between b and c. If a accesses X, b accesses Y and c accesses X, then a will never conflict with b and b will never conflict with c, but a conflicts with c. So a transitive closure would need to be constructed over the after and before relations.

Another thing to consider is whether the dependencies should be constructed when the systems could conflict or when they do conflict. i.e, should the FilteredAccessSet<ComponentId> or Access<ArchetypeComponentId> be compared. If Access is compared then the schedule would need to be rebuilt every time archetypes change.

Jan 15 '22 12:01 bilsen

Another thing to consider is whether the dependencies should be constructed when the systems could conflict or when they do conflict. i.e, should the FilteredAccessSet<ComponentId> or Access<ArchetypeComponentId> be compared. If Access is compared then the schedule would need to be rebuilt every time archetypes change.

@bilsen Schedule construction does not consider this. It produces a top-sorted list of systems from the DAG we get from the label ordering. ComponentId access is only checked when reporting ambiguities (archetypes don't exist yet). However, the executor does look at ArchetypeComponentId access when determining if another system task can be started.

Jan 15 '22 15:01 maniwani

Schedule construction does not consider this. It produces a top-sorted list of systems from the DAG we get from the label ordering. ComponentId access is only checked when reporting ambiguities (archetypes don't exist yet). However, the executor does look at ArchetypeComponentId access when determining if another system task can be started.

@maniwani Then I'm not sure I understand how before should work in the proposal. How is it decided when there is a dependency between two systems without looking at their accesses? The validity of the sorted list will depend on whether systems can/will conflict with each other and also have a before dependency.

Jan 15 '22 16:01 bilsen

My understanding is that the list will be sorted based on the before unconditionally, but there's no reason the executor can't 'disregard' this sorting, since running the systems in the other order would be valid. That is, before only encodes the relative order in the 'queue', and the ArchetypeComponent level collisions encode whether that ordering matters at the time of execution.

Jan 15 '22 16:01 DJMcNab

My understanding is that the list will be sorted based on the before unconditionally, but there's no reason the executor can't 'disregard' this sorting, since running the systems in the other order would be valid. That is, before only encodes the relative order in the 'queue', and the ArchetypeComponent level collisions encode whether that ordering matters at the time of execution.

Exactly this.

The system graph is completely determined by the order of labels (edit: well, ordering targets now). If system a is marked as before system b, we add a to b's dependencies (honestly, "dependency graph" is a bit of a misnomer). We do warn users about pairs of systems that 1) are in the same level in the graph and 2) write the same ComponentId, but every topological order is valid.

When the executor is signaling spawned tasks to start (edit: which are spawned in the sorted order), it checks their "dependencies" and their ArchetypeComponentId access. If all a system's dependencies have completed and its access doesn't conflict with another system task currently running, it can start.

I'm also not 100% clear on the if-needed semantics, but I assume we'd discard the if-needed dependencies once the list is sorted, and just preserve the strict and atomic ones.

Jan 15 '22 16:01 maniwani

I'm also not 100% clear on the if-needed semantics, but I assume we'd discard the if-needed dependencies once the list is sorted, and just preserve the strict and atomic ones.

Precisely. The if-needed dependencies get converted to strict or discarded.

Jan 15 '22 21:01 alice-i-cecile

On Subgraphs

We should be thinking more explicitly about how we create subgraphs. In the current proposal there are three different ways that we define subgraphs. We have: Running schedules from a exclusive system; Giving several systems the same label; Putting system into a system set. Schedules are nicely contained since you can't share labels or system sets between schedules. But labeling and systems sets interact and can do so in ways that are hard to reason about.

The issue is that both methods create subgraphs and these subgraphs can overlap with other subgraphs. Consider two subgraphs that would each represent a unit of work, graph A and graph B:

Now A3 and B2 might depend on the same data from system C1 so the user needs to add a dependency to these systems:

This kind of labeling creates 3 sub graphs that are intertwined and no longer independent. This could easily lead to spaghetti code since there is no defined hierarchy between these three subgraphs.

Instead this type of relationship should be expressed as the relationship of C1 to the 2 subgraphs rather than to an individual system. This leaves the subgraphs as opaque units to the higher level graph.

Being able to create connections across vastly different parts of the system graph is actually too powerful and breaks encapsulation of subgraphs defined by users or plugins. It's important that the abstraction for subgraphs isolates the systems inside it's graph from effects of systems outside it's graph. This will lead to well defined and understandable behavior.

Besides organization there are other advantages to encapsulating subgraphs:

Keeps the number of dependencies on individual systems down. This may expecially be important as the graphs grow more complicated. Extra dependencies while light have a non zero effect on performance.
Makes it easier to check ambiguities. The ambiguity detector should only check against the full subgraphs data.
a natural unit on which to apply atomic subgraph behavior.

Schedules as the main way to make subgraphs

I would propose that we merge system sets and schedules into one abstraction and only allow labeling between systems inside the subgraph. Labeling is not allowed between a system inside the subgraph to a system outside of it. This will give schedules the properties argued for above for encapsulation.

To make working with smaller schedules less of a performance footgun, it is important that they are not run from an exclusive system. They should instead be run from an asynchronous system. This will allow the parallel systems of schedules run at the same time to be run in parallel. If there are exclusive systems in schedules running at the same time they will be grouped to be run in a sequence during the exclusive phase of the schedule.

Besides parallelism, this also:

Allows ordering around the async system that runs the schedule with normal system labeling.
Allows an easy place to stick atomic subgraph constraints as they can be captured on running the schedule and released when the schedule is finished.
For convenience we can implement async system for a schedule or create a wrapper, so it is easy to insert into a graph and can be ordered around

Jan 18 '22 05:01 hymm

Seems like there's already bunch of feedback from contributors to the bevy project. As a user of bevy, this proposal sounds amazing ,and I hope I'll eventually be able to rewrite my game in this proposed style.

Mar 08 '22 23:03 JoJoJet

I am a bit confused about the system label and system set merge.

The idea is there are systems and there are groups of systems. I started calling groups of systems "sets" and they're pretty much gonna be implemented as labels with configs (see bevyengine/bevy#4391), meaning you will be able to add sets to an App the same way you add systems (and before saying which systems belong to them).

I also want to point out that assigning a system to a set/group (i.e. labeling it) is different from giving it a unique name. Those will be separate methods.

Saying a group of systems has the same label better describes what we are intending with the new design and makes it easier to picture the group intertwined with other labels, in my opinion.

I think the "set" framing will be significantly more intuitive to users. A label is an abstract thing that has no associated visual (at least not one that matters here). Calling the "group of systems" a "set" is more accurate to how the App thinks about graphs.

As far as the App is concerned, when you label a system, you are adding that system to a group, and that entire group can be ordered with respect to other stuff. The group is itself a node in some dependency graph if you "zoom out".

If you give a system two labels, A and B, you can't have A before B or B before A. I think this is much easier to explain with the set framing, e.g. "If a system belongs to multiple sets, those sets overlap/intersect, so you can't have edges between them."

For example, my expectation of putting a run criteria on a set would have been for the criteria to run once and decide if the whole set should be skipped or not.

This is indeed how it will work.

If I think about a system being part of multiple sets, I would expect it to run multiple times (once per set).

This isn't the case, but I think your intuition here comes from how the SystemSet container behaves.

merge SystemSetConfig and SystemConfig (currently a set config only stores a system config anyways)

IIUC, I've merged these as much as I could in bevyengine/bevy#4391.

If we stick with sets, I would like to rename .to() to .in().

So would I, but in is a reserved keyword, so we can't unfortunately. What about .in_set(), .to_set(), or .under()?

May 02 '22 19:05 maniwani

I am a bit confused about the system label and system set merge.

The idea is there are systems and there are groups of systems. I started calling groups of systems "sets" and they're pretty much gonna be implemented as labels with configs (see bevyengine/bevy#4391), meaning you will be able to add sets to an App the same way you add systems (and before saying which systems belong to them).

Why not go with "group" then? I agree that I might be a bit biased here because I keep associating sets with SystemSet, but for me set also has more meaning than just a group of something (e.g. uniqueness, which doesn't fit here at all).

I also want to point out that assigning a system to a set/group (i.e. labeling it) is different from giving it a unique name. Those will be separate methods.

Could I not just make a set with only one system in it? I think I would prefer that over more API.

For example, my expectation of putting a run criteria on a set would have been for the criteria to run once and decide if the whole set should be skipped or not.

This is indeed how it will work.

The RFC currently states the opposite:

if a run criterion is attached to a system set, a run-criterion-system will be generated for each contained system

this is essential to ensure that run criteria are checking fresh state without creating very difficult-to-satisfy ordering constraints if you need to ensure that all systems behave the same way during a single pass of the schedule or avoid expensive recomputation, precompute the value and store the result in a resource, then read from that in your run criteria instead

So would I, but in is a reserved keyword, so we can't unfortunately. What about .in_set(), .to_set(), or .under()?

I would prefer in_set() then :+1:

May 02 '22 19:05 NiklasEi

I have a question regarding the run criteria example:

.add_system((
   tick_construction_timer, 
   update_construction_progress)
   .chain()
   .run_if(construction_timer_finished)
)

Doesn't this mean "the timer is only ticked and the construction progress is only updated if the timer has finished"? Then I would follow: The timer is not finished -> The timer will not tick -> The timer will never finish -> The construction progress will never be updated.

So to me it would make more sense if the run criteria also handles the timer update.

However, I'm new to Bevy, so maybe I misunderstood the concepts :)

May 03 '22 06:05 TimJentzsch

@TimJentzsch you're totally right. Good eye; I just pushed a fix for that.

May 03 '22 21:05 alice-i-cecile

@TimJentzsch you're totally right. Good eye; I just pushed a fix for that.

Nice!

I think the comment is outdated now though:

// This function can be used as a run criterion,
// because it does not mutate data and returns `bool`

The comment also makes it sound like mutating data inside runtime criteria is not allowed, does the example still work now?

May 03 '22 21:05 TimJentzsch

The comment also makes it sound like mutating data inside runtime criteria is not allowed, does the example still work now?

Oh right 🤔 Yeah, this wouldn't compile. Let me try again!

May 03 '22 22:05 alice-i-cecile

@alice-i-cecile Would it maybe work if we keep the separate timer update system, but change the app logic as follows?

.add_system(update_construction_timer)
.add_system(
    update_construction_progress
        .after(update_construction_timer)
        .run_if(construction_timer_finished)
)

I think the timer would then update on every render, then the run criteria would check if the timer has finished and then execute the update if applicable.

May 03 '22 22:05 TimJentzsch

I don't like that the resource for controlling state transitions is called Fsm.

It makes Bevy harder to learn, as it makes beginners deal with technical jargon they might not be familiar with, while learning a very important piece of functionality.

For experienced users, abbreviated/acronym names like that are not easily searchable/findable in docs and feel jarringly different from the clean and readable APIs we have elsewhere.

I propose calling it something like StateControl. Has "state" in the name, easy to find, easy to understand, and no overlap with the State trait.

May 23 '22 12:05 inodentry

I don't like that the resource for controlling state transitions is called Fsm. I propose calling it something like StateControl. Has "state" in the name, easy to find, easy to understand, and no overlap with the State trait.

@inodentry I think you did even better in your awesome iyes_loopless by representing states using 2 separate resources: CurrentState(T) and NextState(T).

I like your approach more because the current state is queried rarely. Most of the time it used when registering systems. But state transitioning used often and the mentioned separation allows to avoid querying the current state: you just insert NextState(T) using Commands. This reduces number of arguments in systems and should be a little better for performance.

Jun 14 '22 21:06 Shatur

@Nilirad

Thanks for taking the time to review and write all this feedback.

The implementation strategy and the appendix seems to focus much on the language features but leaving out the bigger picture. I also don't think that just a big nested list is the best thing for readability and understanding.

Can you clarify what you mean by "bigger picture"?

The big picture is how the design reduces the number of things you can't do (resolves the issues listed in the motivation section).

I wrote the impl list because cart asked us to "break [this design] up into conceptual pieces to be merged in a particular order, with a bias toward the smallest changes possible and a desire to merge small pieces into main as early as possible."

The relationship between a system set and a schedule is unclear sometimes. Looks like there is only a single schedule in the world, but when you checkout a system set from the Systems resource, there is a 1:1 correspondency.

Can you describe what was confusing? Are you asking like, "Why have two names if they're the 'same' thing?"

It's written that a schedule is the executable form of a system set, so they do have a 1:1 relationship. A schedule is a container that can hold and properly run the systems and conditions of a system set you described using the app builder API.

Idk how you got "single schedule in the world" though. Most systems might fall under the app update schedule, but there's no schedule limit. The doc does mention other schedules an ordinary app might have (state transitions, fixed timestep, etc.)

May a graph-based API solve at the root the problems of graph unsolvability and order ambiguities?

I'm not sure what you're asking. We can only try to build from what we're given. Errors and ambiguities exist because users can give invalid and incomplete properties. I don't believe there's an API that can avoid that.

How are conditions applied once the schedule gets flattened? If I understood correctly, information about sets gets lost. So it looks to me that a condition that guards a set is cloned and applied to all the systems under its hierarchy.

The step-by-step process of how conditions would be evaluated is written in the appendix. You can also review the code from the linked prototype.

I think you misinterpreted flattening as permanently throwing information away. We wouldn't do that. We can take a dependency graph and just make another "flattened copy" of it.

Likewise, a condition attached to a system set wouldn't be duplicated to all the systems underneath. It's just the one instance you created. The executor will know to evaluate it before the relevant systems run and skip them all if it returns false. Duplicating it actually wouldn't behave correctly.

That said, I don't think it's necessary for this document to explain minute impl details like that or how properties are transformed into graphs, checked for errors, and then used to produce schedules you can run.

I'm happy to improve clarity, but imo this RFC is ready for cart's review.

Aug 15 '22 23:08 maniwani

@maniwani

Idk how you got "single schedule in the world" though.

Hmm, probably residual knowledge from old versions of this RFC. I remember once finding some contradictions about the number of schedules and that may have reinforced the assumption to the point I didn't understand correctly the top definition.

Can you clarify what you mean by "bigger picture"?

By bigger picture I meant a higher level view of the implementation structure. That would primarily include a cohesive overview of what happens under the hood on the scheduling front after the user defines the App configuration.

I think you misinterpreted flattening as permanently throwing information away.

Yep, I definitely thought that...

imo this RFC is ready for cart's review.

At this point I agree, the design is final and he definitely won't have problems understanding it.

Aug 16 '22 10:08 Nilirad

@cart, I couldn't reply to one of your comments, so I'll just answer it here.

(Edit: moved comment to the original thread here.)

Aug 24 '22 04:08 maniwani

@maniwani

@cart, I couldn't reply to one of your comments,

When comments on existing threads are a part of reviews, they cannot be directly replied to in the review. You can still reply in the original thread. Rather than moving this conversation to the "top level", for continuity can you paste your reply here?

Aug 25 '22 22:08 cart

rfcs rfcs copied to clipboard

Stageless: a complete scheduling overhaul

TL; DR

Prototype

Context

Meta

On Subgraphs

Schedules as the main way to make subgraphs

rfcs
rfcs copied to clipboard