ggml Support for convolution row sizes undivisible by 32 in `ggml_conv_2d_sk

Rounding up the size of the convolution row in ggml_compute_forward_conv_2d_sk_p0_f16_f32 yields NaN values with kernel_size = 14, which is first reported for large CLIP models in monatis/clip.cpp#9. I debugged it in monatis/clip.cpp#11 and found out that the first NaN value comes into being when the size of the convolution row is rounded up to the multiples of 32, and then it propogates through all the calculations, leading to a all-NaN tensor.

https://github.com/ggerganov/ggml/blob/0a63fc0f6cb1915d1fa5c62c8f0f018d072253c9/src/ggml.c#L13181-L13184

With kernel sizes 32 and 16, this is not an issue because ew0 % 32 = 0 in these cases. However, it's 608 (rounded up from 588) with kernel size 14, and a NaN value comes into being in the index 598 of x passed as an argument to ggml_vec_dot_f16. I couldn't discover the exact mechanism behind it, but disabling rounding up as in this PR eliminated the NaN issue in expense for multiplications of the leftover being done elementwise in ggml_vec_dot_f16.

So I'm not sure that this is the most elegant solution, but I wanted to bring this into the attention with a working solution.

Jun 21 '23 12:06 monatis

The failing test is test-grad0, which is failing also in master due to a timeout.

Jun 21 '23 12:06 monatis

I suppose that the calculation for the work size buffer here is wrong:

https://github.com/ggerganov/ggml/blob/80be90862935862e27ef790f38b8913ae8cff8b9/src/ggml.c#L16525-L16541

I think it should be:

cur = sizeof(ggml_fp16_t)*(ne10*ne11*ne12)*ggml_up32(ne00*ne01*ne02);

But not sure. Let me know if you give it a try - it's better to try to understand why the 32 rounding does not work

Jun 24 '23 15:06 ggerganov

Unfortunately it didn't work. It first increased the memory requirement for the computation buffer, and when I allocated the required memory the NaN issue kicked back. But I believe that you pointed out the right place that I should look into. I'll revisit it with a fresh mind and let you know.

Jun 24 '23 19:06 monatis

I think it's more related to the kernel data (src0) not prepared in wdata unlike src1 --trying to understand the memory layout there.

Jun 25 '23 11:06 monatis

Ah yes - that might be the case. Overall, the conv operators are quite cryptic - wonder if there is a better way to implement those.

So in that case, I'll merge the current PR for now

Jun 25 '23 12:06 ggerganov

Thanks, I'll dig deeper into it later on.

Now that this is merged, I'll raise a PR to add a link to clip.cpp shortly.

Jun 25 '23 13:06 monatis

ggml
ggml copied to clipboard

Support for convolution row sizes undivisible by 32 in `ggml_conv_2d_sk_p0`

ggml ggml copied to clipboard

Support for convolution row sizes undivisible by 32 in `ggml_conv_2d_sk_p0`

ggml
ggml copied to clipboard