llama.cpp
llama.cpp copied to clipboard
Add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures)
This adds extra functions
ggml_pad_circular
ggml_pad_ext_circular
That have equivalent signatures to the non-circular versions (I considered modifying the existing ones, but didn't want to break existing code). Instead of padding with zeros, they act "on a torus" and loop x and y around.
I implemented this for CUDA, CPU, and Vulkan, as those are the primary backends people use in KoboldCpp/Stable Diffusion Cpp to generate images. For other backends, it'll fall back to non-circular.
This can be used to make seamless textures, see https://github.com/leejet/stable-diffusion.cpp/pull/914 for an example and the changes needed on the image generation side. For some models (Stable Diffusion) simply calling the circular functions is sufficient, for other models (Qwen Image) you need to modify Rope embeddings slightly as well (so they cleanly loop).
I ran CI tests and added tests for these, but happy to answer any questions/modify things as needed.
(Edit notes: a previous version of this pr had also circular for conv, but we've decided that only circular pad is needed)
I am wondering, is it possible to add only a variant of ggml_pad with circular padding, use that as separate operation before the convolutions, then do the convolution without padding? How much slower is that?
Adding circular padding natively to all convolutions on all/most backends is a lot of investment. I'm not sure how common it is, so it would be interesting to know the trade-off.
I am wondering, is it possible to add only a variant of
ggml_padwith circular padding, use that as separate operation before the convolutions, then do the convolution without padding? How much slower is that?Adding circular padding natively to all convolutions on all/most backends is a lot of investment. I'm not sure how common it is, so it would be interesting to know the trade-off.
Huh, yes that's a very good suggestion and seems to work well.
For Qwen Image, using Vulkan on a 3090, I get 1.28s/it using pad ahead of time, vs 1.27s/it using circular convs, which is within rounding error, very little performance penalty. I'll update the PR to only do circular padding since that's all we need.
Ok it should be ready now