stable-diffusion.cpp ROCm broken in recent versions

I just wanted to compile a more recent version of stable-diffusion.cpp with the ROCm backend (gfx1030), but while it compiles just fine (it needs PIC enabled, otherwise it won't link for me), I get these error at runtime. Latest ROCm installed (6.4.3):

ggml_cuda_compute_forward: GET_ROWS failed
ROCm error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at /[sourcedir]/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2522
  err
/[sourcedir]/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:87: ROCm error
[New LWP 110749]
[New LWP 110743]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fb7477107e3 in __GI___wait4 (pid=110751, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Warnung: 30	../sysdeps/unix/sysv/linux/wait4.c: Datei oder Verzeichnis nicht gefunden
#0  0x00007fb7477107e3 in __GI___wait4 (pid=110751, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000055e8452d8cd6 in ggml_print_backtrace ()
#2  0x000055e8452d8f29 in ggml_abort ()
#3  0x000055e844f48252 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#4  0x000055e844f4f229 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#5  0x000055e8452f08ac in ggml_backend_graph_compute ()
#6  0x000055e844daf44c in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) ()
#7  0x000055e844daf0e3 in CLIPTextModelRunner::compute(int, ggml_tensor*, int, void*, unsigned long, bool, ggml_tensor**, ggml_context*) ()
#8  0x000055e844dec8cb in FrozenCLIPEmbedderWithCustomWords::get_learned_condition_common(ggml_context*, int, std::vector<int, std::allocator<int> >&, std::vector<float, std::allocator<float> >&, int, int, int, int, bool) ()
#9  0x000055e844deb7b1 in FrozenCLIPEmbedderWithCustomWords::get_learned_condition(ggml_context*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, int, int, bool) ()
#10 0x000055e844d53d20 in generate_image_internal(sd_ctx_t*, ggml_context*, ggml_tensor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, sd_guidance_params_t, float, int, int, sample_method_t, std::vector<float, std::allocator<float> > const&, long, int, sd_image_t, float, float, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, bool, ggml_tensor*, ggml_tensor*) ()
#11 0x000055e844d58df0 in generate_image ()
#12 0x000055e844ce2ed2 in main ()
[Inferior 1 (process 110742) detached]

My previous build from March this year works just fine. I also tried building for Vulkan, that works as well, but it is significantly slower (3.2s a round vs. 6.3s) and needs more VRAM (needs --vae-on-cpu or will crash otherwise).

Is it possible to restore ROCm support?

Best regards Stefan

Sep 13 '25 22:09 Stefan-Olt

master-fce6afc is working for me (gfx1102, ROCm 6.4.1). Which model are you running? And with which command line parameters?

Sep 14 '25 00:09 wbruna

I'm running SD XL, for example this command line parameter: sd -m '/[modeldir]/sd_xl_base_1.0.safetensors' --vae '/[modeldir]/sdxl_vae.safetensors' --t5xxl '/[modeldir]/t5xxl_fp16.safetensors' -H 1024 -W 1024 -p 'a lovely cat holding a sign says \"Stable diffusion XL\"' -v --cfg-scale 4.5 This is output at the start:

Option: 
    n_threads:                         12
    mode:                              img_gen
    model_path:                        /[modeldir]/sd_xl_base_1.0.safetensors
    wtype:                             unspecified
    clip_l_path:                       
    clip_g_path:                       
    clip_vision_path:                  
    t5xxl_path:                        /[modeldir]/t5xxl_fp16.safetensors
    diffusion_model_path:              
    high_noise_diffusion_model_path:   
    vae_path:                          /[modeldir]/sdxl_vae.safetensors
    taesd_path:                        
    esrgan_path:                       
    control_net_path:                  
    embedding_dir:                     
    stacked_id_embed_dir:              
    input_id_images_path:              
    style ratio:                       20.00
    normalize input image:             false
    output_path:                       output.png
    init_image_path:                   
    end_image_path:                    
    mask_image_path:                   
    control_image_path:                
    ref_images_paths:
    increase_ref_index:                false
    offload_params_to_cpu:             false
    clip_on_cpu:                       false
    control_net_cpu:                   false
    vae_on_cpu:                        false
    diffusion flash attention:         false
    diffusion Conv2d direct:           false
    vae_conv_direct:                   false
    control_strength:                  0.90
    prompt:                            a lovely cat holding a sign says \"Stable diffusion XL\"
    negative_prompt:                   
    clip_skip:                         -1
    width:                             1024
    height:                            1024
    sample_params:                     (txt_cfg: 4.50, img_cfg: 4.50, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler_a, sample_steps: 20, eta: 0.00)
    high_noise_sample_params:          (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler_a, sample_steps: -1, eta: 0.00)
    moe_boundary:                      0.875
    flow_shift:                        inf
    strength(img2img):                 0.75
    rng:                               cuda
    seed:                              42
    batch_count:                       1
    vae_tiling:                        false
    upscale_repeats:                   1
    chroma_use_dit_mask:               true
    chroma_use_t5_mask:                false
    chroma_t5_mask_pad:                1
    video_frames:                      1
    fps:                               16
System Info: 
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:143  - Using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6600, gfx1030 (0x1030), VMM: no, Wave Size: 32

I was running master-b017918, just retested with master-0ebe6fe, the very same issue

Sep 15 '25 01:09 Stefan-Olt

sd -m '/[modeldir]/sd_xl_base_1.0.safetensors' --vae '/[modeldir]/sdxl_vae.safetensors' --t5xxl '/[modeldir]/t5xxl_fp16.safetensors' -H 1024 -W 1024 -p 'a lovely cat holding a sign says \"Stable diffusion XL\"' -v --cfg-scale 4.5

Not likely to be the issue here, but you don't need T5XXL for SDXL models.

What about the way you are building the binaries? How exactly are you enabling PIC? My command line:

cmake -B build_dir stable-diffusion.cpp -DSD_BUILD_SHARED_LIBS=ON -DSD_HIPBLAS=ON -DGPU_TARGETS=gfx1102 && cmake --build build_dir

Sep 15 '25 01:09 wbruna

This was my build command, just as given in README.md, with the exception that I added -DCMAKE_POSITION_INDEPENDENT_CODE=ON:

cmake .. -G "Ninja" -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON -DCMAKE_POSITION_INDEPENDENT_CODE=ON

I just tested your build command (of course for gfx1030), compiles also fine (because you build as a shared library you don't need to enable position independent code, as shared libraries are usually build with that). Unfortunately it gives exactly the same error (I also tried removing t5xxl from my command-line, again no difference).

If it works for you with RDNA3/gfx1102 I assume ggml doesn't build correctly for RDNA2/gfx1030 at the moment.

Sep 15 '25 02:09 Stefan-Olt

Yeah, I agree it should work. I'd suggest asking around on the llama.cpp project; you'd have much more visibility there (there is a separate ggml repository, but most ggml development happens in llama.cpp).

By the way, you mentioned a build from March that is working; so it was built with an older ROCm version. I'd check if a new build of that old sd.cpp version also works for you.

Sep 15 '25 11:09 wbruna

I have now tested with ROCm 7.0.1: Same issue I compiled and run llama.cpp with ROCm: Works with no issue at all I replaced the stable-diffusion.cpp ggml source with the one from llama.cpp: More warnings during compilation, but will build and the resulting binary will show exactly the same error.

So I assume that either ggml is either not build correctly or is used differently than by llama.cpp.

Is there any model/quantization etc. I could test llama.cpp with that closely resembles the ggml use of stable-diffusion.cpp?

Sep 18 '25 22:09 Stefan-Olt

There is a known issue that could cause a GET_ROWS error (#837), but not for f16 models... maybe that sd_xl_base_1.0.safetensors of yours has an uncommon quant?

Does it work with --emb-dir ., --type f16 or --clip-on-cpu?

Sep 18 '25 22:09 wbruna

Unfortunately that options did not change anything. I also tried the flux model, works with an old build, gives very similar error with new build:

[ERROR] ggml_extend.hpp:71   - ggml_cuda_compute_forward: ADD failed
[ERROR] ggml_extend.hpp:71   - ROCm error: invalid device function
[ERROR] ggml_extend.hpp:71   -   current device: 0, in function ggml_cuda_compute_forward at /[sourcedir]/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2522
[ERROR] ggml_extend.hpp:71   -   err
/[sourcedir]/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:87: ROCm error

I tried with q3_k and q4_0 quantization.

There is no mention of GET_ROWS here. I would assume that ADD is not an invalid device function. Could it be that the build process for GGML with ROCm has changed somehow (AMD ROCm now lists llama.cpp as officially supported, maybe there was some cooperation with AMD devs to improve the GGML build process with ROCm) and this change is not implemented in stable-diffusion.cpp yet? Or maybe some other changes in the API? I tried with the GGML that is currently in stable-diffusion.cpp and the one that is contained in llama.cpp, both give the same result, but llama.cpp itself works just fine, so I think it's not a GGML fault.

Sep 19 '25 15:09 Stefan-Olt

That's the same error I get when I run it with my Rx 5700xt (unsupported by ROCm) instead of my Rx 6800

Sep 19 '25 16:09 stduhpf

That's why I think this is a build issue, GGML is not build correctly for my GPU architecture. But in general there is support, otherwise llama.cpp wouldn't work (llama.cpp also tells me it using ROCm at runtime and because the GPU fans als turn up I assume this is correct)

Btw, I think you can build ROCm with RDNA1 support yourself (I got it working with a Rx 5500 some time ago).

Sep 19 '25 16:09 Stefan-Olt

Lol, it was build error, but not because of incorrect code / cmake files: The documentation was wrong. A variable was renamed from AMDGPU_TARGETS to GPU_TARGETS in the documentation, resulting in the incorrect build.

I opened a pull request for a documentation fix.

Sep 19 '25 17:09 Stefan-Olt

The fix has been merged, maybe close this issue?

Nov 06 '25 08:11 lastrosade