l_khr_extended_async_copies cannot deal with arbitary strides.

Open MeyeChris opened this issue 3 years ago • 1 comments

Both async_work_group_copy_2D2D and async_work_group_copy_3D3D measure both num_elements_per_line and [src/dst]_total_line_length in terms of element.

This means the data length and the strides when measured in bytes must be a multiple of the num_bytes_per_element.

For the data length this natural and it is difficult to see how it could not be true. But for stride it can easily be true.

Take for example an element size of 3, that needs to be aligned on 8 byte boundaries. If the data length is 3 (9 bytes) then a stride of 16 (16 - 9 = 7) cannot be expressed in elements. One would need to make the stride 24 (24 - 9 = 15) = 5 elements.

So it can be expressed - wastefully - but often the input data is already on a stride.

The net effect if the people just convert everything into bytes, think in bytes (even for the power of 2 cases).

Stride would much more naturally be expressed in bytes - it is simple the distance between row starts.

Everything relating to line length must apply to plane area as well.

Nov 29 '22 17:11 MeyeChris

Hi @MeyeChris,

I randomly passed by this and see your message. I was the one who motivated and suggested the spec change (from gentype-based to element-based) "à l'époque".

About your concern "Why were strides expressed in elements and not in bytes ?" Please recall that the strides in the async_work_group_strided_copy() function were also expressed in gentype and not in bytes. In this new 2D/3D spec, the tuple {src_total_line_length * num_bytes_per_element} is simply a generalization (and more generic) form of gentype-based strides. For example, we can copy and skip elements of any type, e.g. float19 (not a supported gentype) by setting num_bytes_per_element = 19*sizeof(float).

In your usecase where there are padding between elements, thinking in bytes would be a solution:

size_t num_bytes_per_element = 1;
size_t num_elements_per_line = 9;
size_t src_total_line_length = 16;       // 16 x 1-byte-elements between row starts

Also, you raised a good point on "Why don't we just convert everything into bytes ?". The cudaMemcpy2D() family functions did that, and stand on developers to provide correct information of their 2D grid partitioning, in bytes. There are two camps here and the better one may depend on people's preference. I personally think that the element-scheme gives a good tradeoff to application developers, while can still be narrowed down to the byte-scheme quite easily if necessary.

Aug 22 '24 23:08 hominhquan