bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Split `GenericTypeCell::get_or_insert` into smaller pieces

Open bushrat011899 opened this issue 1 year ago • 5 comments

Objective

Based on the discussion in #14864, I wanted to experiment with the core GenericTypeCell type, whose get_or_insert method accounted for 2% of the final binary size of the 3d_scene example. The reason for this large percentage is likely because the type is fundamental to the rest of Bevy while having 3 generic parameters (the type stored T, the type to retrieve G, and the function used to insert a new value F).

  • Acts on #14864

Solution

  • Split get_or_insert into smaller functions with minimised parameterisation. These new functions are private as to preserve the public facing API, but could be exposed if desired.

Testing

  • Ran CI locally.
  • Used cargo bloat --release --example 3d_scene -n 100000 --message-format json > out.json and @cart's bloat analyzer to measure a 428KiB reduction in binary size when compiling on Windows 10.
  • ~I have not benchmarked to determine if this improves/hurts performance.~ See below.

Notes

In my opinion this seems like a good test-case for the concept of debloating generics within the Bevy codebase. I believe the performance impact here is negligible in either direction (at runtime and compile time), but the binary reduction is measurable and quite significant for a relatively minor change in code.

bushrat011899 avatar Aug 22 '24 07:08 bushrat011899

Happy to merge once there's a benchmark testing the effect :)

alice-i-cecile avatar Aug 22 '24 16:08 alice-i-cecile

(@MrGVSV's comments should be resolved before merging)

cart avatar Aug 22 '24 19:08 cart

Given that these are private functions without explicit inlining directives, I believe the compiler is free to choose if/how to inline these? Clearly its not inlining everything given the size reduction, but I wonder if explicit #[inline(never)] annotations would have an additional effect on size.

I experimented with adding inline(never) and it made zero difference to the final binary size, so I suspect the compiler is already choosing to not inline these functions. Intuitively that makes sense to me, since to the compiler it looks like these functions are being called thousands of times across the code base (due to all the different generic arguments), so inlining would represent increased code duplication.

bushrat011899 avatar Aug 22 '24 23:08 bushrat011899

Some further data:

Cargo Bloat Before: Total Size: 34326KiB
Total Size: 34326KiB
3.27%    1121KiB:        core::ops::function::FnOnce::call_once
2.70%    926KiB:         hashbrown::raw::inner::RawTable<T,A>::reserve_rehash
2.26%    776KiB:         bevy_reflect::utility::GenericTypeCell<T>::get_or_insert
1.74%    596KiB:         std::sync::once::Once::call_once_force::{{closure}}
1.08%    371KiB:         bevy_ecs::query::state::QueryState<D,F>::new_with_access
1.07%    368KiB:         core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
0.91%    312KiB:         <dyn bevy_reflect::reflect::Reflect>::take
0.90%    310KiB:         <bevy_ecs::change_detection::ResMut<T> as bevy_ecs::system::system_param::SystemParam>::init_state
0.86%    295KiB:         bevy_ecs::system::system::System::run
0.80%    273KiB:         <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_struct
0.71%    243KiB:         bevy_asset::server::impl$0::load_internal::async_fn$0
0.71%    242KiB:         bevy_ecs::reflect::from_reflect_with_fallback
0.70%    241KiB:         <bevy_ecs::change_detection::Res<T> as bevy_ecs::system::system_param::SystemParam>::init_state
0.69%    237KiB:         bevy_reflect::type_registry::TypeRegistry::register
0.68%    234KiB:         <bevy_ecs::system::function_system::FunctionSystem<Marker,F> as bevy_ecs::system::system::System>::run_unsafe
0.67%    231KiB:         bevy_reflect::struct_trait::StructInfo::new
0.64%    220KiB:         bevy_reflect::enums::enum_trait::EnumInfo::new
0.56%    193KiB:         <bevy_ecs::system::query::Query<D,F> as bevy_ecs::system::system_param::SystemParam>::init_state
0.56%    190KiB:         <&T as core::fmt::Debug>::fmt
0.52%    178KiB:         <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
0.50%    171KiB:         bevy_reflect::reflect::PartialReflect::apply
0.50%    170KiB:         <bevy_ecs::system::function_system::FunctionSystem<Marker,F> as bevy_ecs::system::system::System>::initialize
0.49%    168KiB:         alloc::sync::Arc<T,A>::drop_slow
0.46%    158KiB:         futures_lite::future::block_on
0.46%    157KiB:         bevy_ecs::bundle::Bundles::init_info
Cargo Bloat After: Total Size: 33961KiB
Total Size: 33961KiB
3.25%    1102KiB:        core::ops::function::FnOnce::call_once
2.72%    922KiB:         hashbrown::raw::inner::RawTable<T,A>::reserve_rehash
1.67%    567KiB:         std::sync::once::Once::call_once_force::{{closure}}
1.10%    374KiB:         core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
1.10%    371KiB:         bevy_ecs::query::state::QueryState<D,F>::new_with_access
1.00%    339KiB:         bevy_reflect::utility::GenericTypeCell<T>::get_or_insert_by_type_id
0.92%    313KiB:         <dyn bevy_reflect::reflect::Reflect>::take
0.91%    308KiB:         <bevy_ecs::change_detection::ResMut<T> as bevy_ecs::system::system_param::SystemParam>::init_state
0.86%    290KiB:         bevy_ecs::system::system::System::run
0.81%    273KiB:         <&mut serde_json::de::Deserializer<R> as serde::de::Deserializer>::deserialize_struct
0.76%    259KiB:         bevy_reflect::struct_trait::StructInfo::new
0.72%    243KiB:         bevy_ecs::reflect::from_reflect_with_fallback
0.71%    242KiB:         <bevy_ecs::change_detection::Res<T> as bevy_ecs::system::system_param::SystemParam>::init_state
0.69%    235KiB:         bevy_reflect::type_registry::TypeRegistry::register
0.68%    231KiB:         <bevy_ecs::system::function_system::FunctionSystem<Marker,F> as bevy_ecs::system::system::System>::run_unsafe
0.60%    202KiB:         bevy_reflect::enums::enum_trait::EnumInfo::new
0.56%    191KiB:         <bevy_ecs::system::query::Query<D,F> as bevy_ecs::system::system_param::SystemParam>::init_state
0.55%    188KiB:         <&T as core::fmt::Debug>::fmt
0.53%    180KiB:         <bevy_ecs::system::function_system::FunctionSystem<Marker,F> as bevy_ecs::system::system::System>::initialize
0.53%    178KiB:         <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
0.51%    172KiB:         bevy_reflect::reflect::PartialReflect::apply
0.50%    169KiB:         alloc::sync::Arc<T,A>::drop_slow
0.48%    164KiB:         futures_lite::future::block_on
0.47%    160KiB:         bevy_ecs::bundle::Bundles::init_info

Final size difference for the 3d_scene example was 365KiB in the .text section, and we can see that where get_or_insert was the 3rd largest contributor to binary bloat before, get_or_insert_by_type_id drops down to 6th place. I'm going to try running cargo bench before and after the change as well just to verify everyone's suspicions that this has negligible performance impact.

bushrat011899 avatar Aug 22 '24 23:08 bushrat011899

Ok so here's what I did:

  1. Switched to current main branch
  2. Ran cargo bench from the /benches directory
  3. Switched to this PR branch
  4. Ran cargo bench again

I did this on a Windows 10 desktop with an AMD 5950X CPU, 64GB of RAM, and an AMD 5700 GPU.

I have the Criterion reports, but they're a lot of data (150MB for the folder) so I don't know how easy it will be to share, so I'll do my best to summarise what I can gleam below. I have also uploaded a copy of the reports only (not the raw data or JSON) to Google Drive if anyone wants to see my results.

  • 391 benchmarks showed a performance improvement, while 375 showed a performance regression, of any size.
  • The vast majority of benchmarks reported changes in the sub-5% range, which I think are negligible. Criterion will confirm these are statistically significant results but they could be caused by anything (e.g., binary layout changes, etc.)
  • A minority of benchmarks (44 in the improvement group and 37 in the regression group) report a 10% to 75% change. I'm not confident these can be ignored, but I'm suspicious this PR could have any measurable performance impact.

I'm not an expert in benchmarking by any stretch of the imagination so anyone more confident than me is welcome to provide a better interpretation (please!), but to my eyes this looks like a noisy wash. If this had a meaningful impact on performance I would expect the benchmarks to skew in one direction or the other, but instead they seem split down the middle.

bushrat011899 avatar Aug 23 '24 02:08 bushrat011899