wgpu-py icon indicating copy to clipboard operation
wgpu-py copied to clipboard

create_buffer_with_data requires data to be a multiple of COPY_BUFFER_ALIGNMENT

Open fyellin opened this issue 1 year ago • 11 comments

The helper function create_buffer_with_data gives an error if the data it passes is not a multiple of 4.

Here is a reasonable use case for why I would want to do that:

I have an external model that includes vertices, normals, and indices of triangles. These models are often written by others

     vertices_array = np.array(model.vertices, dtype=np.float32)
     vertices_buffer = device.create_buffer_with_data(vertices_array, usage=BufferUsage.VERTEX)
      . . .
     indices_array = np.array(model.indices, dtype=uint16)
     indices_buffer = device.create_buffer_with_data(indices_array, usage=BufferUsage.INDEX)
      . . . 
     encoder.set_index_buffer(indices_buffer, "uint16")
     encoder.draw_index(len(indices_array))

This works fine, unless the model happens to have an odd number of indexes. In that case, the byte size of indices_array will not be a multiple of 4, and the code fails.

There are solutions around the problem, but they are not always pretty:

  1. I can test the length of model.indices and add a junk element to the end if it is odd. But note that this information belongs to someone else, and may be readonly.
  2. I can just avoid using create_buffer_with_data, but perform the mapping myself.

But since create_buffer_with_data is a convenience function and not subject to a WebGPU specification, I think it's reasonable for it to always round up the size to a multiple of 4. This could either (1) always be done, since it is invisible to the user or (2) be done if the user specifies allows_padding or something like that, indicating that they are aware of the problem.

fyellin avatar Jun 03 '24 20:06 fyellin

I believe this should not be handled by wgpu-py. It is the responsibility of the user program to manage the byte data written to the buffer, including byte alignment, padding, and other considerations. Since this is a very low-level API, it should not make any assumptions about the content written to the buffer nor be responsible for it. For example, vertex attributes need to be aligned to 4 bytes, while storage buffers and uniform buffers require 16-byte alignment (and their structure members also needing alignment according to the WebGPU specification). These responsibilities should be managed by the user program or downstream engines (such as libraries like pygfx).

Let's see what Almar thinks. :)

panxinmiao avatar Jun 04 '24 03:06 panxinmiao

May I suggest a different fix that's more in the spirit of the API? Add a bytes argument. If omitted, it defaults to the size of the memory view. It must always be larger.

This would let me use create_buffer_with_data in cases where I have gotten the info from elsewhere and can't grow it.

fyellin avatar Jun 04 '24 05:06 fyellin

I realize that this is a different implementation, but on my browser I tried:

> x = device.createBuffer({size: 6, usage: GPUBufferUsage.VERTEX, mappedAtCreation: true})

and it returned me a buffer, without complaint.

fyellin avatar Jun 04 '24 05:06 fyellin

I realize that this is a different implementation, but on my browser I tried:

> x = device.createBuffer({size: 6, usage: GPUBufferUsage.VERTEX, mappedAtCreation: true})

and it returned me a buffer, without complaint.

Interesting, but the WebGPU specification explicitly states that this is incorrect.

See: https://www.w3.org/TR/webgpu/#dom-gpudevice-createbuffer

image

Moreover, when calling writeBuffer(), it is also required that the number of bytes written must be a multiple of 4.

panxinmiao avatar Jun 04 '24 07:06 panxinmiao

I did a quick test on Chrome browser. image

While the buffer object can be created, continuing to call the getMappedRange() method throws an error. I suspect that before calling getMappedRange(), it is only a JavaScript object and has not actually created a GPU object. However, regardless, I feel that Chrome's implementation of this API may not fully comply with the specification. 😅

panxinmiao avatar Jun 04 '24 07:06 panxinmiao

I agree with @panxinmiao that alignment is ultimately the caller's responsibility. But I can also see that in a method like create_buffer_with_data() we can provide some convenience to make it easier for the user to manage alignment. My proposal would be a align_bytes arg that can be set to e.g. 4 or 16. Does that make sense @fyellin ?

almarklein avatar Jun 04 '24 09:06 almarklein

Sure. That sounds reasonable. My original proposal, I suppose, was just to make align_bytes=4 be the default. But I like the idea of making the user be responsible for it.

I can implement this, if you want.

fyellin avatar Jun 04 '24 18:06 fyellin

I can implement this, if you want.

That would be great! 🙏

to make align_bytes=4 be the default.

Mmm, it's tempting, but I think better that the user must explicitly set it to use alignment. That way it's explicit that something related to alignment is happening. And easier for the user to spot why this works, and manual mapping does not.

almarklein avatar Jun 05 '24 09:06 almarklein

I still have doubts about whether this approach brings more benefits or more confusion.

In WebGPU, almost all user-provided data needs to be laid out in memory to match the definitions in the shaders. It’s not just a simple matter of padding to 4-byte alignment at the end.

Our API should not make any assumptions about the data provided by the user, nor should it be responsible for doing so—in fact, it can't be. The caller must ensure that the data is correctly laid out in memory; otherwise, even if no exception is thrown, the results will not be correct.

In reality, if the number of bytes of the data provided by the user is not a multiple of four, it’s evident that the user has not handled the data layout correctly, or there is an error in the data layout or format. In such cases, throwing an error early might actually be a better approach.

In the example you provided, for vertex data in the float32 format, it just so happens that no memory layout adjustments are needed. This is because when creating the rendering pipeline, there is a specific vertex data memory layout description (through properties like array_stride, offset, etc.). Typically, for float32x3 vertex position data, you might configure the vertex layout description like this:

{
    "array_stride": 3 * 4,
    "step_mode": "vertex",
    "attributes": [
        {"shader_location": 0, "offset": 0, "format": "float32x3"}  // position
    ]
}

This coincidentally matches the memory layout of an Nx3 float32 numpy array, and that’s all.

However, if you want to place Nx3 float32 data into a uniform buffer or storage buffer (corresponding to array<vec3f> in WSGL), Or transmit any other structure, you won’t be as fortunate. You must handle the alignment of each array element or structure member and fill the gaps according to W3C requirements. It requires 16 byte alignment, so the buffer you passed should roughly as follows:

expanded_array = np.concatenate((positions, np.zeros((N, 1),dtype=np.float32)), axis=1)

In fact, I feel that the only useful scenario for adding the (align=4) parameter is when creating an index buffer in uint16 format. In this single case, if the original buffer size is not a multiple of four, the API automatically pads it to 4-byte alignment, resulting in the correct outcome. In all other cases, if your original buffer is not a multiple of four bytes, it’s clear that there is an issue with your original data layout, and the result is most likely incorrect.

panxinmiao avatar Jun 06 '24 09:06 panxinmiao

@panxinmiao. You are correct. The only use case I have is an index buffer. Programs that generate wire-frame models don't necessarily know that they are producing them for a GPU, and may not pad them appropriately. This is the one case where the user is using external data rather than arrays that they are creating themselves.

fyellin avatar Jun 06 '24 17:06 fyellin

Good point @panxinmiao !

almarklein avatar Jun 07 '24 11:06 almarklein

I think I'm going to re-open this. Over the weekend I looked at the C WGPU API and they handle this situation. The documentation is that it gets rounded up to a multiple of 4. Turns out to be easy to do.

fyellin avatar Oct 21 '24 06:10 fyellin