bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Raw table iteration to improve query iteration speed by bypassing change ticks

Open chengts95 opened this issue 1 month ago • 8 comments

What problem does this solve or what need does it fill?

Bevy’s query system does not automatically allow AVX2 optimization, and the changed tick writes can consume significant memory bandwidth. While this may not impact 60fps graphics programs significantly, it severely affects performance in hot loops and physical/game logic simulations that rely on iterative algorithms or time acceleration (e.g., KSP, HoI4, Minecraft, and RimWorld — even though RimWorld is 2D, it still faces significant performance challenges, unlike the 3D games).

In these games and applications, even with a 60fps graphical frame rate, the computational workload per frame can be much higher due to simulation steps, time acceleration, and compression. These applications, which rely on high-performance ECS and iterative calculations, are the ones most in need of optimization. Unfortunately, the current Bevy architecture prevents the full exploitation of ECS capabilities, making it difficult to optimize these demanding use cases effectively.

Notably, the ECS from C and C++, such as Flecs and EnTT have no such problem. Meanwhile, an ECS should not only serve for creating 60fps game so we should consider fix the issues to promote a generic Data-Oriented Programming paradigm.

Currently, bevy_ecs has the unsafe API from query state, but it should be trivial to make it a safe/unsafe API in standard Query. Also, the changed tick problem should be carefully addressed.

What solution would you like?

Add the Flecs style table iteration API to Query. The Changed Tick can also be moved to the table level. Meanwhile, the best way to detect changes is still using observer or messagebus. It may be also possible to enable specific components' changed detection from runtime plugin.

What alternative(s) have you considered?

Use Query state and manually use unsafe codes to iterate. Manage change ticks by ourselves or branch the bevy_ecs for different demands.

Additional context

A SIMD parallel performance test is shown here to demonstrate the necessarity.

chengts95 avatar Nov 16 '25 18:11 chengts95

Can you provide more context on what you're actaully proposing here? What does Add the Flecs style table iteration API to Query concretely mean?

Is the problem specifically about change detection?

cBournhonesque avatar Nov 16 '25 20:11 cBournhonesque

Now, a move_system in Bevy can be implemented like this:

pub fn move_system_iter(mut q: Query<(&mut Position, &Velocity)>) {
    q.iter_mut().for_each(|(mut pos, v)| {
        pos.x += v.x;
        pos.y += v.y;
    })
}

This code cannot utilize AVX SIMD by default, and it writes changed ticks, which consumes the CPU's limited memory bandwidth. The table iteration is now implemented like this; it can implement SIMD functions, and the changed tick is bypassed.

#[target_feature(enable = "avx2")]
fn move_execution(positions: &mut [Position], velocities: &[Velocity]) {
    for i in 0..positions.len() {
        positions[i].x += velocities[i].x;
        positions[i].y += velocities[i].y;
    }
}
// ----- 4. Serial Table-based (AVX2) System -----
#[allow(mutable_transmutes)]

pub fn move_system_tabled(world: &mut World, q: &mut QueryState<(&mut Position, &Velocity)>) {
    unsafe {
        let storages = world.storages();
        q.update_archetypes(world);
        let component_id = world.component_id::<Position>().unwrap();
        let component_id2 = world.component_id::<Velocity>().unwrap();
        for table_id in q.matched_tables() {
            let table = storages.tables.get(table_id).unwrap_unchecked();
            let positions = table
                .get_data_slice_for::<Position>(component_id)
                .unwrap_unchecked();
            let velocities = table
                .get_data_slice_for::<Velocity>(component_id2)
                .unwrap_unchecked();

            // Safety: we know &mut Position,so transmute is safe
            let positions = std::mem::transmute::<_, &mut [Position]>(positions);
            let velocities = std::mem::transmute::<_, &[Velocity]>(velocities);
            move_execution(positions, velocities);
        }
    }
}

The version with the table is 2x faster than default version.

I hope this can be implemented into Query's API. However, due to the fact that the changed tick is bypassed, a table-level changed tick might be used. Meanwhile, I hope there is a way to totally disable changed ticks since they are not used for every component type.

chengts95 avatar Nov 16 '25 21:11 chengts95

Can you provide more context on what you're actaully proposing here? What does Add the Flecs style table iteration API to Query concretely mean?

Is the problem specifically about change detection?

The flecs use

void Move(ecs_iter_t *it) {
  Position *p = ecs_field(it, Position, 0);
  Velocity *v = ecs_field(it, Velocity, 1);

  for (int i = 0; i < it->count; i++) {
    p[i].x += v[i].x;
    p[i].y += v[i].y;
  }
}

This function is called on each table.

chengts95 avatar Nov 16 '25 21:11 chengts95

This is a goal that I support, and is closely related to #4882. I don't think that either this issue or that are sufficient on their own: query iteration shouldn't be slowed down by change ticks when it doesn't matter, and memory size concerns of pointless change ticks are still concerning even if they don't slow down query iteration.

That said, I am a bit nervous about the proposed implementation strategy. It's a sensible stop-gap, but it proposes a "slow-by-default" path, and more knobs for users to worry about when trying to optimize their queries. Instead, we should be looking for a solution that tries to analyze the user's intent / program structure, and then does this by default.

My initial thought is that we should be able to move where change ticks are stored, and then omit them completely for standard read-only queries (i.e. &C). That would require benchmarking, but shouldn't be too controversial. For mutable queries, I think it mostly comes down to solving #4882, but a type-level BypassChangeDetection<C> that does not fetch change ticks would also be relatively uncontroversial.

alice-i-cecile avatar Nov 16 '25 22:11 alice-i-cecile

I believe frameworks usually do not have such a changed tick thing by default. The problem is that now many existing systems use the feature. If it is not easy to address, I prefer some way to disable these for a separate world context. But I must point out that for_each is not enough for SIMD optimization; we must access the table column directly to do that.

chengts95 avatar Nov 16 '25 22:11 chengts95

But I must point out that for_each is not enough for SIMD optimization, we must access the table column directly to do that.

That's interesting: there's no way to achieve this in the context of iterators? That lines up with my many-years-old experience with R, but I haven't devoted a lot of time to SIMD in RUST. I would be very interested in some reading material if you have some at hand.

@Jondolf may have useful insight here, I know that he's poked at this in the context of physics.

alice-i-cecile avatar Nov 16 '25 22:11 alice-i-cecile

But I must point out that for_each is not enough for SIMD optimization, we must access the table column directly to do that.

That's interesting: there's no way to achieve this in the context of iterators? That lines up with my many-years-old experience with R, but I haven't devoted a lot of time to SIMD in RUST. I would be very interested in some reading material if you have some at hand.

@Jondolf may have useful insight here, I know that he's poked at this in the context of physics.

I have posted the test repo and code to do the table iteration. This feature is easy to implement with the current bevy api since there is a matched_table from QueryState, but it is not available for Query (and bypassing changed ticks may cause trouble). The problem is for_each cannot emit derived SIMD by default (must use global compile flag). Real-world optimized SIMD code needs to use manually tuned kernels, but for_each cannot reveal the memory layout. Meanwhile, with default for_each the #[target_feature(enable = "avx2")] cannot be flagged on IntoSystem Trait or the closure inside the for_each.

chengts95 avatar Nov 16 '25 22:11 chengts95

@Jondolf may have useful insight here, I know that he's poked at this in the context of physics.

Right, whoops I forgot to actually respond 😅

I don't have too much experience with auto-vectorization, since for the more performance-critical things I primarily need to use manual SIMD instructions and/or portable SIMD like core::simd (nightly) or the wide crate (stable). Though, naturally the memory layout and access patterns needed for this often work well with auto-vectorization too, for example for unsupported SIMD targets the fallback of just [f32; 4] instead of f32x4 (i.e. __m128 on x86) tends to work pretty well.

But yeah, that's largely unrelated to this issue. I can definitely see improved query iteration speed being useful for a few physics things, like velocity integration and position integration, as shown as an example earlier in this issue. These are mostly trivial and highly parallelizable loops over a ton of entities, and for that use case you also don't really want change ticks, since it runs for all active rigid bodies every frame anyway, and other systems also change the values afterwards. So I'm definitely interested in this work!

Jondolf avatar Nov 30 '25 20:11 Jondolf