Steward Garcia comments

Results 92 comments of


                                            Steward Garcia

Custom attention bias

@b-albar Could this work with infinite negatives (custom attention mask), It seems like I have to reshape the array to (batch_size, num_heads, seq_len, seq_len). It would be good if broadcasting...

Much higher RAM usage (2-3 times) compared to FastSDCPU when using the exact same models/settings

Currently, im2col is being used for convolutions, which consumes a very high amount of RAM during the VAE phase. I have been working on a kernel that merges im2col and...

Inference bottleneck

The truth is that, yes, the CPU backend isn't as optimized as it could be; perhaps it's the im2col kernel since it overuses memory accesses. In all ML software, the...

Support Inpainting

@leejet I believe that is done by adding noise only to the white part of the latent image, and in the decoder, keeping the pixels of the black part unchanged....

Support Inpainting

@leejet I think we should first solve that problem before considering adding the inpainting feature. Inpainting models require a latent image with 9 input channels, 4 for the usual channels,...

SDXL-Lightning support

@mzwing ~~I'll try to implement the missing scheduler, but I'm not exactly sure which of the models you've uploaded to Hugging Face I should try to see if I get...

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading

@bssrdf > Is im2col going to be skipped? Or done on the fly? I am going to do something similar to flash attention. I am going to divide the blocks...

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading

@leejet I'm not sure if this could lead to memory leaks since it needs to be created for each model, and it's a lot 10MB to store just the metadata...

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading

For now, the kernel I created to avoid the overhead of im2col results in a 50% reduction in performance, even though it's only applied to the operation that generates a...

WIP - Web server + conv2d fused + k-quants + dynamic gpu offloading

@Green-Sky I'll try to do tests on RTX 3060, mainly with CUDA Toolkit 11.8. The truth is that there isn't a standard API for stable diffusion. For example, the ComfyUI...