ROCm broken in recent versions
I just wanted to compile a more recent version of stable-diffusion.cpp with the ROCm backend (gfx1030), but while it compiles just fine (it needs PIC enabled, otherwise it won't link for me), I get these error at runtime. Latest ROCm installed (6.4.3):
ggml_cuda_compute_forward: GET_ROWS failed
ROCm error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at /[sourcedir]/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2522
err
/[sourcedir]/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:87: ROCm error
[New LWP 110749]
[New LWP 110743]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fb7477107e3 in __GI___wait4 (pid=110751, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Warnung: 30 ../sysdeps/unix/sysv/linux/wait4.c: Datei oder Verzeichnis nicht gefunden
#0 0x00007fb7477107e3 in __GI___wait4 (pid=110751, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x000055e8452d8cd6 in ggml_print_backtrace ()
#2 0x000055e8452d8f29 in ggml_abort ()
#3 0x000055e844f48252 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#4 0x000055e844f4f229 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#5 0x000055e8452f08ac in ggml_backend_graph_compute ()
#6 0x000055e844daf44c in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) ()
#7 0x000055e844daf0e3 in CLIPTextModelRunner::compute(int, ggml_tensor*, int, void*, unsigned long, bool, ggml_tensor**, ggml_context*) ()
#8 0x000055e844dec8cb in FrozenCLIPEmbedderWithCustomWords::get_learned_condition_common(ggml_context*, int, std::vector<int, std::allocator<int> >&, std::vector<float, std::allocator<float> >&, int, int, int, int, bool) ()
#9 0x000055e844deb7b1 in FrozenCLIPEmbedderWithCustomWords::get_learned_condition(ggml_context*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int, int, int, bool) ()
#10 0x000055e844d53d20 in generate_image_internal(sd_ctx_t*, ggml_context*, ggml_tensor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, sd_guidance_params_t, float, int, int, sample_method_t, std::vector<float, std::allocator<float> > const&, long, int, sd_image_t, float, float, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, bool, ggml_tensor*, ggml_tensor*) ()
#11 0x000055e844d58df0 in generate_image ()
#12 0x000055e844ce2ed2 in main ()
[Inferior 1 (process 110742) detached]
My previous build from March this year works just fine. I also tried building for Vulkan, that works as well, but it is significantly slower (3.2s a round vs. 6.3s) and needs more VRAM (needs --vae-on-cpu or will crash otherwise).
Is it possible to restore ROCm support?
Best regards Stefan
master-fce6afc is working for me (gfx1102, ROCm 6.4.1). Which model are you running? And with which command line parameters?
I'm running SD XL, for example this command line parameter: sd -m '/[modeldir]/sd_xl_base_1.0.safetensors' --vae '/[modeldir]/sdxl_vae.safetensors' --t5xxl '/[modeldir]/t5xxl_fp16.safetensors' -H 1024 -W 1024 -p 'a lovely cat holding a sign says \"Stable diffusion XL\"' -v --cfg-scale 4.5
This is output at the start:
Option:
n_threads: 12
mode: img_gen
model_path: /[modeldir]/sd_xl_base_1.0.safetensors
wtype: unspecified
clip_l_path:
clip_g_path:
clip_vision_path:
t5xxl_path: /[modeldir]/t5xxl_fp16.safetensors
diffusion_model_path:
high_noise_diffusion_model_path:
vae_path: /[modeldir]/sdxl_vae.safetensors
taesd_path:
esrgan_path:
control_net_path:
embedding_dir:
stacked_id_embed_dir:
input_id_images_path:
style ratio: 20.00
normalize input image: false
output_path: output.png
init_image_path:
end_image_path:
mask_image_path:
control_image_path:
ref_images_paths:
increase_ref_index: false
offload_params_to_cpu: false
clip_on_cpu: false
control_net_cpu: false
vae_on_cpu: false
diffusion flash attention: false
diffusion Conv2d direct: false
vae_conv_direct: false
control_strength: 0.90
prompt: a lovely cat holding a sign says \"Stable diffusion XL\"
negative_prompt:
clip_skip: -1
width: 1024
height: 1024
sample_params: (txt_cfg: 4.50, img_cfg: 4.50, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler_a, sample_steps: 20, eta: 0.00)
high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: default, sample_method: euler_a, sample_steps: -1, eta: 0.00)
moe_boundary: 0.875
flow_shift: inf
strength(img2img): 0.75
rng: cuda
seed: 42
batch_count: 1
vae_tiling: false
upscale_repeats: 1
chroma_use_dit_mask: true
chroma_use_t5_mask: false
chroma_t5_mask_pad: 1
video_frames: 1
fps: 16
System Info:
SSE3 = 1
AVX = 1
AVX2 = 1
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 1
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[DEBUG] stable-diffusion.cpp:143 - Using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6600, gfx1030 (0x1030), VMM: no, Wave Size: 32
I was running master-b017918, just retested with master-0ebe6fe, the very same issue
sd -m '/[modeldir]/sd_xl_base_1.0.safetensors' --vae '/[modeldir]/sdxl_vae.safetensors' --t5xxl '/[modeldir]/t5xxl_fp16.safetensors' -H 1024 -W 1024 -p 'a lovely cat holding a sign says \"Stable diffusion XL\"' -v --cfg-scale 4.5
Not likely to be the issue here, but you don't need T5XXL for SDXL models.
What about the way you are building the binaries? How exactly are you enabling PIC? My command line:
cmake -B build_dir stable-diffusion.cpp -DSD_BUILD_SHARED_LIBS=ON -DSD_HIPBLAS=ON -DGPU_TARGETS=gfx1102 && cmake --build build_dir
This was my build command, just as given in README.md, with the exception that I added -DCMAKE_POSITION_INDEPENDENT_CODE=ON:
cmake .. -G "Ninja" -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON -DCMAKE_POSITION_INDEPENDENT_CODE=ON
I just tested your build command (of course for gfx1030), compiles also fine (because you build as a shared library you don't need to enable position independent code, as shared libraries are usually build with that). Unfortunately it gives exactly the same error (I also tried removing t5xxl from my command-line, again no difference).
If it works for you with RDNA3/gfx1102 I assume ggml doesn't build correctly for RDNA2/gfx1030 at the moment.
Yeah, I agree it should work. I'd suggest asking around on the llama.cpp project; you'd have much more visibility there (there is a separate ggml repository, but most ggml development happens in llama.cpp).
By the way, you mentioned a build from March that is working; so it was built with an older ROCm version. I'd check if a new build of that old sd.cpp version also works for you.
I have now tested with ROCm 7.0.1: Same issue I compiled and run llama.cpp with ROCm: Works with no issue at all I replaced the stable-diffusion.cpp ggml source with the one from llama.cpp: More warnings during compilation, but will build and the resulting binary will show exactly the same error.
So I assume that either ggml is either not build correctly or is used differently than by llama.cpp.
Is there any model/quantization etc. I could test llama.cpp with that closely resembles the ggml use of stable-diffusion.cpp?
There is a known issue that could cause a GET_ROWS error (#837), but not for f16 models... maybe that sd_xl_base_1.0.safetensors of yours has an uncommon quant?
Does it work with --emb-dir ., --type f16 or --clip-on-cpu?
Unfortunately that options did not change anything. I also tried the flux model, works with an old build, gives very similar error with new build:
[ERROR] ggml_extend.hpp:71 - ggml_cuda_compute_forward: ADD failed
[ERROR] ggml_extend.hpp:71 - ROCm error: invalid device function
[ERROR] ggml_extend.hpp:71 - current device: 0, in function ggml_cuda_compute_forward at /[sourcedir]/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2522
[ERROR] ggml_extend.hpp:71 - err
/[sourcedir]/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:87: ROCm error
I tried with q3_k and q4_0 quantization.
There is no mention of GET_ROWS here. I would assume that ADD is not an invalid device function. Could it be that the build process for GGML with ROCm has changed somehow (AMD ROCm now lists llama.cpp as officially supported, maybe there was some cooperation with AMD devs to improve the GGML build process with ROCm) and this change is not implemented in stable-diffusion.cpp yet? Or maybe some other changes in the API? I tried with the GGML that is currently in stable-diffusion.cpp and the one that is contained in llama.cpp, both give the same result, but llama.cpp itself works just fine, so I think it's not a GGML fault.
That's the same error I get when I run it with my Rx 5700xt (unsupported by ROCm) instead of my Rx 6800
That's why I think this is a build issue, GGML is not build correctly for my GPU architecture. But in general there is support, otherwise llama.cpp wouldn't work (llama.cpp also tells me it using ROCm at runtime and because the GPU fans als turn up I assume this is correct)
Btw, I think you can build ROCm with RDNA1 support yourself (I got it working with a Rx 5500 some time ago).
Lol, it was build error, but not because of incorrect code / cmake files: The documentation was wrong. A variable was renamed from AMDGPU_TARGETS to GPU_TARGETS in the documentation, resulting in the incorrect build.
I opened a pull request for a documentation fix.
The fix has been merged, maybe close this issue?