stable-diffusion.cpp SDXL on Snapdragon X Elite Adreno

Hello, thanks for this repository, it's extremely useful. I am trying to run SDXL on Snapdragon X Elite, on the Adreno GPU. It runs the full diffusion process in under 10 seconds, and the result image is blank (grey). I tried --type f16 but did not work. Then, I tried --clip-on-cpu and it worked but it was extremely slow (50 sec/iter).Any ideas to fix it?

PS C:\Users\sborse\dev\llm\stable-diffusion.cpp\build> ./bin/sd -m ..\weights\sd_xl_base_1.0.safetensors --vae ..\weights\sdxl.vae.safetensors -H 1024 -W 1024 -p "a lovely cat" -v --vae-tiling Option: n_threads: 6 mode: img_gen model_path: ..\weights\sd_xl_base_1.0.safetensors wtype: unspecified clip_l_path: clip_g_path: clip_vision_path: t5xxl_path: diffusion_model_path: high_noise_diffusion_model_path: vae_path: ..\weights\sdxl.vae.safetensors taesd_path: esrgan_path: control_net_path: embedding_dir: photo_maker_path: pm_id_images_dir: pm_id_embed_path: pm_style_strength: 20.00 output_path: output.png init_image_path: end_image_path: mask_image_path: control_image_path: ref_images_paths: control_video_path: increase_ref_index: false offload_params_to_cpu: false clip_on_cpu: false control_net_cpu: false vae_on_cpu: false diffusion flash attention: false diffusion Conv2d direct: false vae_conv_direct: false control_strength: 0.90 prompt: a lovely cat negative_prompt: clip_skip: -1 width: 1024 height: 1024 sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: default, sample_steps: 20, eta: 0.00, shifted_timestep: 0) high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: default, sample_steps: -1, eta: 0.00, shifted_timestep: 0) moe_boundary: 0.875 flow_shift: inf strength(img2img): 0.75 rng: cuda seed: 42 batch_count: 1 vae_tiling: true upscale_repeats: 1 chroma_use_dit_mask: true chroma_use_t5_mask: false chroma_t5_mask_pad: 1 video_frames: 1 vace_strength: 1.00 fps: 16 System Info: SSE3 = 0 AVX = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 0 NEON = 1 ARM_FMA = 1 F16C = 0 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [DEBUG] stable-diffusion.cpp:161 - Using OpenCL backend [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)' [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: device: 'Qualcomm(R) Adreno(TM) X1-85 GPU (OpenCL 3.0 Qualcomm(R) Adreno(TM) X1-85 GPU)' [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 832.0 Compiler DX.18.12.00 [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: vector subgroup broadcast support: true [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: device FP16 support: true [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: mem base addr align: 128 [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: max mem alloc size: 2048 MB [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: device max workgroup size: 1024 [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: SVM coarse grain buffer support: false [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: SVM fine grain buffer support: false [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: SVM fine grain system support: false [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: SVM atomics support: false [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q) [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS) [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: loading OpenCL kernels [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - . [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:74 - [INFO ] stable-diffusion.cpp\ggml_extend.hpp:65 - ggml_opencl: default device: 'Qualcomm(R) Adreno(TM) X1-85 GPU (OpenCL 3.0 Qualcomm(R) Adreno(TM) X1-85 GPU)' [INFO ] stable-diffusion.cpp:201 - loading model from '..\weights\sd_xl_base_1.0.safetensors' [INFO ] model.cpp:1044 - load ..\weights\sd_xl_base_1.0.safetensors using safetensors format [DEBUG] model.cpp:1151 - init from '..\weights\sd_xl_base_1.0.safetensors', prefix = '' [INFO ] stable-diffusion.cpp:255 - loading vae from '..\weights\sdxl.vae.safetensors' [INFO ] model.cpp:1044 - load ..\weights\sdxl.vae.safetensors using safetensors format [DEBUG] model.cpp:1151 - init from '..\weights\sdxl.vae.safetensors', prefix = 'vae.' [INFO ] stable-diffusion.cpp:267 - Version: SDXL [INFO ] stable-diffusion.cpp:298 - Weight type: f16 [INFO ] stable-diffusion.cpp:299 - Conditioner weight type: f16 [INFO ] stable-diffusion.cpp:300 - Diffusion model weight type: f16 [INFO ] stable-diffusion.cpp:301 - VAE weight type: f16 [DEBUG] stable-diffusion.cpp:303 - ggml tensor size = 400 bytes [DEBUG] stable-diffusion.cpp\clip.hpp:171 - vocab size: 49408 [DEBUG] stable-diffusion.cpp\clip.hpp:182 - trigger word img already in vocab [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1729 - clip params backend buffer size = 235.06 MB(VRAM) (196 tensors) [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1729 - clip params backend buffer size = 1329.29 MB(VRAM) (517 tensors) [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1729 - unet params backend buffer size = 4900.07 MB(VRAM) (1680 tensors) [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1729 - vae params backend buffer size = 94.47 MB(VRAM) (140 tensors) [DEBUG] stable-diffusion.cpp:565 - loading weights [DEBUG] model.cpp:1961 - using 6 threads for model loading [DEBUG] model.cpp:2044 - loading tensors from ..\weights\sd_xl_base_1.0.safetensors |=============================================> | 2393/2641 - 627.92it/s [DEBUG] model.cpp:2044 - loading tensors from ..\weights\sdxl.vae.safetensors |==================================================| 2641/2641 - 656.15it/s [INFO ] model.cpp:2288 - loading tensors completed, taking 4.04s (process: 0.01s, read: 3.52s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.35s) [INFO ] stable-diffusion.cpp:661 - total params memory size = 6558.89MB (VRAM 6558.89MB, RAM 0.00MB): text_encoders 1564.36MB(VRAM), diffusion_model 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM) [INFO ] stable-diffusion.cpp:714 - running in eps-prediction mode [DEBUG] stable-diffusion.cpp:725 - finished loaded file [DEBUG] stable-diffusion.cpp:2262 - generate_image 1024x1024 [INFO ] stable-diffusion.cpp:2383 - TXT2IMG [INFO ] stable-diffusion.cpp:874 - attempting to apply 0 LoRAs [INFO ] stable-diffusion.cpp:894 - apply_loras completed, taking 0.00s [DEBUG] stable-diffusion.cpp:895 - prompt after extract and remove lora: "a lovely cat" [DEBUG] stable-diffusion.cpp\conditioner.hpp:345 - parse 'a lovely cat' to [['a lovely cat', 1], ] [DEBUG] stable-diffusion.cpp\clip.hpp:311 - token length: 77 [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1553 - clip compute buffer size: 1.40 MB(VRAM) [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1553 - clip compute buffer size: 2.33 MB(VRAM) [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1553 - clip compute buffer size: 2.33 MB(VRAM) [DEBUG] stable-diffusion.cpp\conditioner.hpp:479 - computing condition graph completed, taking 3724 ms [DEBUG] stable-diffusion.cpp\conditioner.hpp:345 - parse '' to [['', 1], ] [DEBUG] stable-diffusion.cpp\clip.hpp:311 - token length: 77 [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1553 - clip compute buffer size: 1.40 MB(VRAM) [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1553 - clip compute buffer size: 2.33 MB(VRAM) [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1553 - clip compute buffer size: 2.33 MB(VRAM) [DEBUG] stable-diffusion.cpp\conditioner.hpp:479 - computing condition graph completed, taking 74 ms [INFO ] stable-diffusion.cpp:2049 - get_learned_condition completed, taking 3802 ms [INFO ] stable-diffusion.cpp:2072 - sampling using Euler A method [INFO ] stable-diffusion.cpp:2121 - generating image: 1/1 - seed 42 [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1553 - unet compute buffer size: 830.86 MB(VRAM) |==================================================| 20/20 - 3.61it/s [INFO ] stable-diffusion.cpp:2158 - sampling completed, taking 5.54s [INFO ] stable-diffusion.cpp:2166 - generating 1 latent images completed, taking 6.14s [INFO ] stable-diffusion.cpp:2169 - decoding 1 latents [DEBUG] stable-diffusion.cpp:1521 - VAE Tile size: 32x32 [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:817 - num tiles : 7, 7 [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:818 - optimal overlap : 0.500000, 0.500000 (targeting 0.500000) [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:851 - tile work buffer size: 0.77 MB [INFO ] stable-diffusion.cpp\ggml_extend.hpp:864 - processing 49 tiles [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1553 - vae compute buffer size: 416.02 MB(VRAM) |==================================================| 49/49 - 76.92it/s [DEBUG] stable-diffusion.cpp:1547 - computing vae decode graph completed, taking 0.56s [INFO ] stable-diffusion.cpp:2179 - latent 1 decoded, taking 0.56s [INFO ] stable-diffusion.cpp:2183 - decode_first_stage completed, taking 0.56s [INFO ] stable-diffusion.cpp:2475 - generate_image completed in 10.51s save result PNG image to 'output.png'

Oct 10 '25 04:10 sborse3

Since you say that generation appears to be working with --clip-on-cpu, your Snapdragon's Adreno GPU (I'm assuming this is a on phone... please let me know if not) can't process the text encoders merged into the model you're using and needs to run them on the CPU instead.

NB: From the "Using OpenCL (for Adreno GPU)" portion of the README:

Currently, it supports only Adreno GPUs and is primarily optimized for Q4_0 type

Try using a Q4_0 quant of SDXL, such as the stable-diffusion-xl-base-1.0-Q4_0.gguf from here. If that still needs --clip-on-cpu to generate an image, try downloading the clip_l.safetensors and clip_g.safetensors files from hum-ma's repository as well, and using those specifically in your sd parameters.

Can't guarantee this is going to solve your problem, but it's worth a shot. If it's still won't work without --clip-on-cpu after trying this, it could be a general OpenCL limitation or one specific to the Snapdragon SoC you have.

Oct 10 '25 05:10 MrSnichovitch

Also, IIRC OpenCL benefits from both --diffusion-conv-direct and --vae-conv-direct (faster inference, reduced memory usage).

Oct 10 '25 10:10 wbruna

Since you say that generation appears to be working with --clip-on-cpu, your Snapdragon's Adreno GPU (I'm assuming this is a on phone... please let me know if not) can't process the text encoders merged into the model you're using and needs to run them on the CPU instead.

NB: From the "Using OpenCL (for Adreno GPU)" portion of the README:

Currently, it supports only Adreno GPUs and is primarily optimized for Q4_0 type

Try using a Q4_0 quant of SDXL, such as the stable-diffusion-xl-base-1.0-Q4_0.gguf from here. If that still needs --clip-on-cpu to generate an image, try downloading the clip_l.safetensors and clip_g.safetensors files from hum-ma's repository as well, and using those specifically in your sd parameters.

Can't guarantee this is going to solve your problem, but it's worth a shot. If it's still won't work without --clip-on-cpu after trying this, it could be a general OpenCL limitation or one specific to the Snapdragon SoC you have.

@MrSnichovitch , it is not on mobile; I am trying to run on the X elite Dell laptop.

Btw when I try with the 4_0 gguf, I get this error: [ERROR] stable-diffusion.cpp:263 - get sd version from file failed: '..\weights\stable-diffusion-xl-base-1.0-Q4_0.gguf' new_sd_ctx_t failed

Oct 10 '25 18:10 sborse3

@sborse3 Well... that's not good. Sorry for steering you astray on that.

I've tried this and a few other SDXL Q4_0 GGUFs from huggingface myself, and get the same error using both Vulkan and ROCm. If I use specific VAE, Clip_L and Clip_G files, I get other errors that tank generation, such as...

[INFO ] model.cpp:2268 - loading tensors completed, taking 1.02s (process: 0.01s, read: 0.32s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.13s)
[ERROR] model.cpp:2341 - tensor 'cond_stage_model.1.transformer.text_model.text_projection' not in model file
[ERROR] model.cpp:2341 - tensor 'model.diffusion_model.output_blocks.2.2.conv.bias' not in model file
[ERROR] model.cpp:2341 - tensor 'model.diffusion_model.output_blocks.2.2.conv.weight' not in model file
[ERROR] stable-diffusion.cpp:581  - load tensors from model loader failed
new_sd_ctx_t failed

Haven't really used any SDXL base stuff since the dawn of Illustrious models and then Chroma, so these errors are new to me. Tried using sd to convert full Illustrious models to Q4_0 locally, and they work fine. Seems like this is more a problem with pre-fab quants rather than with sd, but I'm not qualified to say. Could also mean I have the wrong CLIP/VAE files downloaded for the models I tried, but I stuck with what was available in each repo the models were pulled from, so I don't think so.

Regardless, if the original model you were using only works with --clip-on-cpu, even when specifying the --clip_l and --clip_g files, try @wbruna's advice with --diffusion-conv-direct and/or --vae-conv-direct, and also consider dropping the output resolution to speed things up (e.g., 768x768).

Oct 11 '25 05:10 MrSnichovitch

SDXL on Snapdragon X Elite Adreno - Blank Image