encase icon indicating copy to clipboard operation
encase copied to clipboard

Specialize write_into for trivially memcpy-able types

Open james7132 opened this issue 1 year ago • 4 comments

Right now, encase recursively calls write_into when serializing out types into GPU memory. This typically follows the flow of copy field/value -> advance the writer by it's padding -> repeat until done.

This unfortunately breaks vectorization when copying the data. Larger structs with heavily recursive types like 4x4 matrices end up with tens or hundreds of these steps when they could be just directly memcpy-ed into the target buffer.

For all types that have a runtime fixed size and do not have any additional padding, they're trivially memcpy-able into and out of GPU buffers. Similarly, arrays, slices and Vecs of these types are can also be batch memcpy-ed where applicable.

This information is statically available at compile time in a type's METADATA. If statements on constant expressions will optimize out the unused branch. This should be doable even without compiler support for specialization.

james7132 avatar Sep 21 '23 11:09 james7132

This was discussed on the Bevy Discord (link). A summarization of the analysis of the codegen:

  • encase already does a capacity check only once per write: https://github.com/teoxoy/encase/blob/308bb7248f052a16604e8bf77a7118bea05bf9ea/src/core/rw.rs#L26
  • ArrayMetadata's accessors not being inlined, even on the most aggressive compilation settings.
    • Experimental change: Commit
    • Change in codegen: diff
    • Observed result: Metadata not being accessed causes unsupported use cases (i.e. certain structs in uniform buffers) to collapsing into the panic, resulting in a lot more codegen than necessary. Though this is probably not actively impacting hotpath performance
    • Additional note: this also probably applies to the other metadata types as well, even if they're all const.
  • SliceExt::array and SliceExt::array_mut both use checked conversions into &mut [T] on the target subslice. This results in a lot of extra branching. Attempted to replace this with an unchecked conversion instead.
    • Experimental change: commit
    • Change in codegen: diff
    • Observed result: All of the branches disappeared. The copy is not vectorized, though it should be, but there aren't any unnecessary branches anymore.
    • Additional note: This implementation is unsound as there is no capacity check when converting to the array. Either the unsafe needs to be lifted out and added as an invariant (basically treating the slice as a raw pointer), or we need to find another way to avoid the branch.

TODO: Actually benchmark the changes here to see if the gains are significant enough to warrant these kinds of changes.

james7132 avatar Sep 22 '23 10:09 james7132

One potential other middle ground is to change the vector and matrix implementations to directly copy their bytes instead of relying on the underlying components' implementations. This would eliminate the vast majority of the branches being produced, potentially allows for vectorized copies, and avoids the need for infectious use of unsafe.

james7132 avatar Sep 25 '23 21:09 james7132

This is great stuff, thanks for looking into it!

I think the most promising optimization here would be using unchecked versions of SliceExt::array and SliceExt::array_mut, making all read and write methods unsafe and making sure we always check the bounds in the buffer wrappers which is the API most users will interact with anyway. A PR doing this would be welcome but I'm curious how much perf we will get from this, we should benchmark it.

How did you generate the codegen in the diffs above? With the unchecked change, I think those copies should have been vectorized.

Regarding const fns not being inlined, that's odd to me, I thought all const fns would be evaluated at compile time and their result inlined. If you could open a PR that adds the #[inline] attribute on all const fns, that would be great!

teoxoy avatar Nov 26 '23 13:11 teoxoy

How did you generate the codegen in the diffs above? With the unchecked change, I think those copies should have been vectorized.

It's honestly sort of jank. I compile results with cargo and use rust flags to make rustc output the *.asm files, then use rustfilt to pretty it up a bit, then using git to track the diff in a nice interface. I'd use something like Godbolt, but setting up my own server with the Rust crates I want tends to be a lot more effort.

Regarding const fns not being inlined, that's odd to me, I thought all const fns would be evaluated at compile time and their result inlined. If you could open a PR that adds the #[inline] attribute on all const fns, that would be great!

If the entire expression is constant (inputs, outputs, static dependencies), the result will be computed at compile time, but if used in a non-const context, it will be treated as a normal function, including not inlining if the function is considered big enough.

james7132 avatar Feb 09 '24 04:02 james7132