stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: Segmentation fault running on docker with Radeon 5700

Open badarg1 opened this issue 2 years ago • 30 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

I'm tryin to run this in a docker container on an Ubuntu 22.04.1 machine with a Radeon 5700 ITX GPU (8 GB), a Ryzen 5 3600 CPU, and 16 GB of RAM.

I followed the instructions from the wiki: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#running-inside-docker

When I try to start the UI, I get a segmentation fault:

(venv) root@borg:/dockerx/stable-diffusion-webui# TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' python launch.py --precision full --no-half
Python 3.9.5 (default, Nov 23 2021, 15:27:38) 
[GCC 9.3.0]
Commit hash: c9bded39ee05bd0507ccd27d2b674d86d6c0c8e8
Installing requirements for Web UI
Launching Web UI with arguments: --precision full --no-half
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [81761151] from /dockerx/stable-diffusion-webui/models/Stable-diffusion/model.ckpt
Applying cross attention optimization (Doggettx).
Segmentation fault (core dumped)

I upgraded to python 3.9 following the instructions in the wiki, with the same results.

I suspect it might be related to this other issue, but created a new issue as I'm not sure: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6403

Steps to reproduce the problem

  1. Set up the docker container as instructed in the wiki
  2. Start the UI with the command provided in the wiki

What should have happened?

UI should start up.

Commit where the problem happens

c9bded39ee05bd0507ccd27d2b674d86d6c0c8e8

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

No response

Command Line Arguments

I use `--precision full` and `--no-half` as instructed in the wiki.

I also tried removing them in any combination, with no result.

Additional information, context and logs

The image id of the docker image I'm using is 614789dfdb38.

Find the dumped core here (2.8 GB): https://drive.google.com/file/d/1n-ulnrYZ1pjkF9xUJgYasCY3qk8rr5vW/view?usp=share_link

badarg1 avatar Jan 06 '23 11:01 badarg1

Seeing possibly the same thing on a 7900XTX, running directly on Arch Linux (without Docker):

LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [81761151] from /home/wsippel/Applications/stable-diffusion-webui/models/Stable-diffusion/model.ckpt
fish: Job 1, './webui.sh' terminated by signal SIGSEGV (Address boundary error)

wsippel avatar Jan 06 '23 13:01 wsippel

I have a similar issue on Pop!_OS 22.04 with rx570 8gb, intel xeon 1650v2, 16GB of RAM. I used this command to get docker image: docker pull rocm/pytorch:rocm5.4.1_ubuntu20.04_py3.7_pytorch_1.12.1

(venv) root@pop-os:/dockerx/stable-diffusion-webui# TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' REQS_FILE='requirements.txt' python launch.py
Python 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]
Commit hash: 874b975bf8438b2b5ee6d8540d63b2e2da6b8dbd
Installing requirements for Web UI
Launching Web UI with arguments: 
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [7460a6fa] from /dockerx/stable-diffusion-webui/models/Stable-diffusion/sd-v1-4.ckpt
Applying cross attention optimization (Doggettx).
Segmentation fault (core dumped)

RocketRide9 avatar Jan 07 '23 07:01 RocketRide9

Same issue, except I'm using a radeon duo pro polaris and no docker

sALTaccount avatar Jan 07 '23 13:01 sALTaccount

Same issue here. No docker. Spotted this on dmesg output:

[13066.414044] python3[140319]: segfault at 20 ip 00007fbd318d71d2 sp 00007fff7bc3fcd0 error 4 in libamdhip64.so.5.4.50401[7fbd3181f000+351000]

irusensei avatar Jan 07 '23 15:01 irusensei

Same issue. RX 6600

JilekJosef avatar Jan 07 '23 20:01 JilekJosef

Oh, actually I just found solution or at least for myself. I used the HSA_OVERRIDE_GFX_VERSION=10.3.0 fix and run the command below and the segmentation fault disappeared. TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' REQS_FILE='requirements.txt' HSA_OVERRIDE_GFX_VERSION=10.3.0 python launch.py

JilekJosef avatar Jan 07 '23 20:01 JilekJosef

I tried to use HSA_OVERRIDE_GFX_VERSION=10.3.0 but it results in Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check error:

(venv) root@pop-os:/dockerx/stable-diffusion-webui# TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' HSA_OVERRIDE_GFX_VERSION=10.3.0 python launch.py
Python 3.9.5 (default, Nov 23 2021, 15:27:38) 
[GCC 9.3.0]
Commit hash: 874b975bf8438b2b5ee6d8540d63b2e2da6b8dbd
Traceback (most recent call last):
  File "/dockerx/stable-diffusion-webui/launch.py", line 306, in <module>
    prepare_environment()
  File "/dockerx/stable-diffusion-webui/launch.py", line 221, in prepare_environment
    run_python("import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'")
  File "/dockerx/stable-diffusion-webui/launch.py", line 88, in run_python
    return run(f'"{python}" -c "{code}"', desc, errdesc)
  File "/dockerx/stable-diffusion-webui/launch.py", line 64, in run
    raise RuntimeError(message)
RuntimeError: Error running command.
Command: "/dockerx/stable-diffusion-webui/venv/bin/python" -c "import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'"
Error code: 139
stdout: <empty>
stderr: Segmentation fault (core dumped)



Adding --skip-torch-cuda-test to COMMANDLINE_ARGS didnt help.

RocketRide9 avatar Jan 08 '23 09:01 RocketRide9

I was still getting errors on my 6800m even with HSA_OVERRIDE_GFX_VERSION=10.3.0. It basically ends up with:

 terminate called after throwing an instance of 'miopen::Exception'
  what():  /MIOpen/src/hipoc/hipoc_program.cpp:300: Code object build failed. Source: naive_conv.cpp
Aborted (core dumped)

After a bit go googling I find some settings that mitigate the problem but I have no idea if setting those which I assume disable naive_conv affect the performance or results.

export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0

It's working now but I have to manage memory even on a 12GB memory GPU. Example: cant do more than 3-4 batches.

irusensei avatar Jan 08 '23 10:01 irusensei

same issue here with a RX580 and RX480. neither

export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_FWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_BWD=0
export MIOPEN_DEBUG_CONV_DIRECT_NAIVE_CONV_WRW=0

nor adding HSA_OVERRIDE_GFX_VERSION=10.3.0

helps

flaep avatar Jan 08 '23 16:01 flaep

This worked for me on Fedora with a 5600g and a 6600xt, 16gb of ram export AMDGPU_TARGETS="gfx1010" export HSA_OVERRIDE_GFX_VERSION=10.3.0 TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' REQS_FILE='requirements.txt' python launch.py --precision full --no-half

I found this from here link but did not have to do step 2

connor-corso avatar Jan 10 '23 18:01 connor-corso

Setting export HSA_OVERRIDE_GFX_VERSION=10.3.0 seems to help and the web UI now loads, but I still can't seem to make it work.

When the web UI loads it prints this on the console:

(venv) root@borg:/dockerx/stable-diffusion-webui# TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.1.1' python launch.py --precision full --no-half
Python 3.9.5 (default, Nov 23 2021, 15:27:38) 
[GCC 9.3.0]
Commit hash: c9bded39ee05bd0507ccd27d2b674d86d6c0c8e8
Installing requirements for Web UI
Launching Web UI with arguments: --precision full --no-half
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [81761151] from /dockerx/stable-diffusion-webui/models/Stable-diffusion/model.ckpt
Applying cross attention optimization (Doggettx).
Textual inversion embeddings loaded(0): 
Model loaded.
Running on local URL:  http://127.0.0.1:7860

After I try to generate a txt2img using the default settings and a simple sentence as prompt, it prints this:

To create a public link, set `share=True` in `launch()`.
  0%|                                                    | 0/20 [00:00<?, ?it/s]

That seems to be a progress bar, but it does not progress. The progress bar in the web UI also does not move. If I hide the progress bar from the web UI (by adding the hidden attribute to it) I find it says 4/75 underneath, but that also does not advance. I left the process running for over 30 minutes and still nothing. How long should this process take with the hardware described in the first post?

BTW, there seems to be activity in the CPU, but only on 1 core:

$ docker stats wonderful_greider --no-stream
CONTAINER ID   NAME                CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O         PIDS
a34014c5911b   wonderful_greider   101.73%   3.772GiB / 15.57GiB   24.23%    0B / 0B   7.28GB / 2.14MB   22

The model I'm using is 4 GB. Maybe it's running on the CPU instead of the GPU?

There is also activity on the GPU, although the cool temperature and low power consumption makes me suspicious:

# rocm-smi


======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU[0]		: sclk current clock frequency not found
================================================================================
GPU  Temp (DieEdge)  AvgPwr  SCLK  MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
0    56.0c           50.0W   None  500Mhz  34.12%  auto  150.0W   60%   99%   
================================================================================
============================= End of ROCm SMI Log ==============================

The VRAM% oscillates between 59% and 63%.

badarg1 avatar Jan 12 '23 16:01 badarg1

@badarg1 Try using other docker image. For whatever reason rocm/pytorch:latest doesn't work for me on RX 5700 since rocm5.3 and gets stuck at 0%. Try rocm/pytorch:rocm5.2_ubuntu20.04_py3.7_pytorch_1.11.0_navi21 or any of the official rocm5.2.3 from rocm/pytorch. Those work perfectly.

49RpK5dY avatar Jan 12 '23 21:01 49RpK5dY

@49RpK5dY I tried with rocm/pytorch:rocm5.2.3_ubuntu20.04_py3.7_pytorch_1.12.1 and it worked. Thank you.

Maybe these workarounds should be documented in the wiki?

badarg1 avatar Jan 12 '23 22:01 badarg1

Maybe these workarounds should be documented in the wiki?

I opened an issue #2655 about this but later closed it as it seemed I was the only one having this problem. It might be relevant to some specific hardware. But yeah, adding this to wiki could be useful.

49RpK5dY avatar Jan 12 '23 23:01 49RpK5dY

i had to use https://github.com/xuhuisheng/rocm-gfx803 for my rx570

RocketRide9 avatar Jan 21 '23 19:01 RocketRide9

@VBBr could you explain in some more details what did you to do make it work? I have very similar issue on my rx570.

Dolidodzik avatar Feb 12 '23 20:02 Dolidodzik

@VBBr could you explain in some more details what did you to do make it work? I have very similar issue on my rx570.

update python version to 3.8 using this guide (just change 3.9 to 3.8 where needed): https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#updating-python-version-inside-docker and follow until webui launch step. Then install rocblas, pytorch and torchvision which are provided here: https://github.com/xuhuisheng/rocm-gfx803. And start webui. If some lib error appear try to go to the directory where needed library should be placed and find another one in the same dir which have slightly different version. i dont have access to my machine with webui installed and dont remember what libs were missing. For example: if webui asks for some_lib.so.1.2.3 or some_lib.so.1.2, but you only have some_lib.so.1.2.2 rename it to what webui asks for(probably not the prettiest solution but It Works™)

RocketRide9 avatar Feb 14 '23 12:02 RocketRide9

got same error on Manjaro with AMD CPU and GPU. https://stackoverflow.com/questions/75591043/got-segmentation-fault-while-launching-stable-diffusion-webui-webui-sh

deba33 avatar Feb 28 '23 11:02 deba33

I was getting the segmentation fault error with the Automatic Installation guide as well, with RX6800 in Artix Linux. Since then, I have found another installation method for Arch-based distributions, which involves using PyTorch and Torchvision built with ROCm from Arch repos.

You can find the written guide here for now, hoping it will be included in this repo too (pointing to #8170)

linwownil avatar Feb 28 '23 11:02 linwownil

Got the same error in Mac m1 cpu. It was fine but suddenly got this error and then can not start anymore.

zhouhao27 avatar Feb 28 '23 12:02 zhouhao27

I'm also running into the Segmentation Fault issue exactly as you mentioned. Except I am not using docker. Running on Fedora 37. I manually created a Python 3.10 virtual environment because Python 3.11 was the default installation.

5950x CPU + 7900 XTX GPU, 32 gb ram

achhabra2 avatar Mar 03 '23 17:03 achhabra2

Same error here. Anyone has come up with a solution, or at least a workaround?

Ryzen 5950x, RX 7900 XTX, 64 GB RAM

chirvo avatar Mar 21 '23 07:03 chirvo

@achhabra2 @bigchirv the 7900 segfaults are its own thing. Pytorch on RDNA3 simply isn't supported in ROCm 5.4. It'll hopefully be fixed with the upcoming 5.5 release.

wsippel avatar Mar 21 '23 09:03 wsippel

Thanks for the heads up!

chirvo avatar Mar 21 '23 10:03 chirvo

Maybe these workarounds should be documented in the wiki?

I opened an issue #2655 about this but later closed it as it seemed I was the only one having this problem. It might be relevant to some specific hardware. But yeah, adding this to wiki could be useful.

Hey, are you ok with your 5700? I have exactly the same problem, can you tell me the detail you are using now? amdgpu version/docker image/python version etc

echoidcf avatar Apr 14 '23 01:04 echoidcf

The docker image is pytorch:rocm5.2_ubuntu20.04_py3.7_pytorch_1.11.0_navi21 but any older image with rocm5.2 should work. I also updated python, the wiki instructions for that are still working. https://download.pytorch.org/whl/rocm5.1.1 no longer works as it's not available any longer. You can install pytorch in venv with this instead: pip install torch==1.13.0+rocm5.2 torchvision==0.14.0+rocm5.2 --extra-index-url https://download.pytorch.org/whl/rocm5.2 and launch with TORCH_COMMAND='pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.2' python launch.py As for other launching parameters I'm using --medvram --no-half --no-half-vae --opt-sub-quad-attention. It will generate gray squares without --no-half and --medvram --opt-sub-quad-attention saves a lot of vram.

49RpK5dY avatar Apr 14 '23 09:04 49RpK5dY

Same error here

RX590 + ubuntu 22.04 + amdgpu-install 5.4.5

segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

zeze0556 avatar May 24 '23 09:05 zeze0556

Same error here

RX590 + ubuntu 22.04 + amdgpu-install 5.4.5

segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.

echoidcf avatar May 24 '23 11:05 echoidcf

Same error here RX590 + ubuntu 22.04 + amdgpu-install 5.4.5 segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.

The original bug is on a Ryzen 5 3600 which includes AVX2 instructions (verified using cat /proc/cpuinfo on my 3600)

I'm getting the same issue on a clean install of Ubuntu Server 22.04.2 with ROCm 5.5.1 and building pytorch / torchvision from source.

shelbydavis avatar May 25 '23 14:05 shelbydavis

Same error here RX590 + ubuntu 22.04 + amdgpu-install 5.4.5 segfault at 20 ip 00007fd9a88b40a7 sp 00007fff1ed96d20 error 4 in libamdhip64.so[7fd9a8800000+3f3000]

OK,OK Let me end this issue. This is because libamdhip64.so call a AVX2 instruction which will cause this problem if you are using an old CPU that does not support AVX2. There is NO workaround for this problem EXCEPT replacing your CPU. I tried to recompile libamdhip64.so but without luck. You can have a try if you insist.

My cpu (Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz) support AVX2 . I have tested rocm 5.4.x and 5.5, and both have the same error.

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz stepping : 3 microcode : 0x28 cpu MHz : 800.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_unknown bogomips : 7995.18 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:

zeze0556 avatar May 26 '23 02:05 zeze0556