wgpu icon indicating copy to clipboard operation
wgpu copied to clipboard

Array Stride Mismatch in vec3-padded Struct

Open froody opened this issue 1 year ago • 10 comments

Description Running a compute shader with an input of an array of struct of 20 bytes will only see the first 20/32 entries.

Repro steps

  1. Clone https://github.com/froody/wgpu/tree/compute-bug
  2. cd examples
  3. cargo run hello_compute

Expected vs observed behavior Observed:

result: 189
result with vertex 20: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 0, 0, 0, 0, 0, 0, 0]
result with vertex 32: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Expected:

result: 189
result with vertex 20: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
result with vertex 32: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Extra materials Screenshots to help explain your problem. Validation logs can be attached in case there are warnings and errors. Zip-compressed API traces and GPU captures can also land here.

Platform MacOS 14.3 (23D56), MacBookPro18,2

froody avatar Feb 18 '24 04:02 froody

you have too much padding? you only need to pad to 16 bytes, not 20 bytes

struct Vertex20 {
    values: vec3<u32>,
    padding: f32,
    padding2: u32,
};

cybersoulK avatar Feb 18 '24 05:02 cybersoulK

@cybersoulK this is just a proof of the bug, In my actual use case I need 20 bytes of data per vertex

froody avatar Feb 18 '24 05:02 froody

then just rename them to fields: [

struct Vertex20 {
    values: [f32; 3],
    field1: f32,
    field2: u32,
    _padding: [f32; 3]
};

you need to add actual padding to be alligned to 16

cybersoulK avatar Feb 18 '24 05:02 cybersoulK

the previous is the cpu side. in wgsl shader you just need:

struct Vertex20 {
    values: vec3<u32>,
    field1: f32,
    field2: u32,
};

(they automatically pad it)

cybersoulK avatar Feb 18 '24 05:02 cybersoulK

to be clear, only Uniform / Storage buffer require alignment to 16 bytes

there is: https://github.com/teoxoy/encase that replaces bytemuck and creates buffers that matches exactly to WGSL, without needing manual padding. This works by adding a derive macro behind your structs.

cybersoulK avatar Feb 18 '24 05:02 cybersoulK

@cybersoulK can you prove there's automatic padding injected for (vec3<u32>,f32, u32)? In the example I provided it's very clear that the shader is striding 20 bytes per index into v20. If the Vertex20 struct in the shader was being padded then I would expect garbage values at non-aligned locations, e.g:

result with vertex 20: [0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0]
result with vertex 32: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

It seems like the decision for how much buffer to map is being made earlier on and isn't actually related to how the GPU is interpreting the buffer (at least on Metal).

froody avatar Feb 18 '24 06:02 froody

https://gist.github.com/froody/f1d4ec656a2110191ea1618187806ba1

froody avatar Feb 18 '24 06:02 froody

The issue here is as follows:

struct Inner { 
    vec: vec3<u32>,
    scalar1: u32,
    scalar2: u32
}

This struct is supposed to be size 32, alignment 16. However, when accessed through an array in metal, the stride shows itself to be 24. I suspect packed_uint3 doesn't have the alignment requirements?

struct Inner {
    metal::packed_uint3 vec;
    float scalar1;
    uint scalar2;
};

Unmark the test as failing on metal in #5264 to see the behavior.

cwfitzgerald avatar Feb 18 '24 07:02 cwfitzgerald

Yeah, https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf#page=31&zoom=auto,-462,785

cwfitzgerald avatar Feb 19 '24 05:02 cwfitzgerald

Probably just need to insert an alignas(16) when we're using packed_uint3

cwfitzgerald avatar Feb 19 '24 05:02 cwfitzgerald