bssrdf

Results 75 comments of bssrdf

> Why are you adding a new ggml op? Because of https://github.com/ggml-org/llama.cpp/pull/15669#discussion_r2311865862

> I think the implementation of implicit gemm can directly use ggml_conv2d_direct. There's really no need to provide so many conv2d functions. I can reuse ggml_conv2d_direct. TBH it is not...

> Took it for a small test drive in sd.cpp for VAE decoding: > > ### 768x1024 sd1 fp16 vae: > method time memory > imcol+mul ~1.68s 4992.19 MB >...

> Corrected test drive in sd.cpp for VAE decoding: > > ### 768x1024 sd1 fp16 vae: > method time memory > CUDA imcol+mul ~1.68s 4992.19 MB > CUDA direct ~35.35s...

> > > Corrected test drive in sd.cpp for VAE decoding: > > > ### 768x1024 sd1 fp16 vae: > > > method time memory > > > CUDA imcol+mul...

Now https://github.com/ggml-org/llama.cpp/pull/16088 is a better implementation.

I am reopening this PR. While working on it, I noticed the current ```cpy_flt``` kernel has significant uncoalesced global access. This is particularly bad if one tries to make a...

> I'm not seeing a copy kernel in this PR, do you have it somewhere else? In any case, I think that optimizing the copy kernel would best be spun...

Please review this PR and ignore the changes related to cpy (they will be reverted back; I have another PR addressing it). The updated numbers on 4090 ``` CONV_2D_IMPLICIT(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): 249...

> Re did the sd.cpp testing with all new commits (sd+ggml+pr), also heat soaked and **_more tests_** and sampling performance. > > ### sd1 fp16 512x768 > method time sample...