stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: Can not choose GPU for running

Open timothy-WangS opened this issue 3 years ago • 9 comments

What happened?

Summary

I have 4 GPUs. Though I selected the GPU in SETTINGS (e.g. GPU 3), the used GPU is still GPU 0.

Here are 4 GPUs, i.e. GPU 0, GPU 1, GPU 2, GPU 3. All are Nvidia 3080 10 GB versions. GPU 0 to GPU 2 has almost no memory left, while GPU 3 has 10 GB available.

  1. Go to 'SETTINGS'-'General'-'GPU', switch to GPU 3, then click 'Save'
  2. Go to 'STABLE DIFFUSION'
  3. Click on 'Generate'. Here is the CUDA error in the terminal: 'CUDA error: out of memory'. In nvidia-smi, GPU 3 has no memory usage and zero gpu-util
  4. Or, instead of step 3, just return to 'SETTINGS'-'General'-'GPU' after step 2, and you can find the setting automatically changes back to default 'GPU 0'

Expected behavior: The program should run on GPU 3 smoothly if I set it so in 'SETTINGS'. And has some GPU memory usage and gpu-util in generating process.

Actual behavior: GPU 3 has no memory usage and zero gpu-util, and CUDA error occurs.

Debug info Streamlit version: Streamlit v1.13.0 Python version: python 3.8 Using Conda OS version: Ubuntu 20.04 Browser version: Firefox 103.0.2

Version

0.0.1 (Default)

What browsers are you seeing the problem on?

Firefox

Where are you running the webui?

Linux

Custom settings

No response

Relevant log output

Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Global Step: 470000
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
2022-10-09 09:01:29.970 CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1., retrying in 0 seconds...
2022-10-09 09:01:30.386 st.session_state has no attribute "defaults". Did you forget to initialize it? More info: https://docs.streamlit.io/library/advanced-features/session-state#initialization, retrying in 0 seconds...
2022-10-09 09:01:30.387 st.session_state has no attribute "defaults". Did you forget to initialize it? More info: https://docs.streamlit.io/library/advanced-features/session-state#initialization, retrying in 0 seconds...

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

timothy-WangS avatar Oct 09 '22 01:10 timothy-WangS

There are some places where the specific cuda device is not chosen, can you test this branch please

I do not have multiple gpu so it is not easy to test.

hlky avatar Oct 09 '22 02:10 hlky

find the setting automatically changes back to default 'GPU 0' @ZeroCool940711 can you investigate this part further? this could just be an issue with the setting being overwritten.

@timothy-WangS could you attach your webui_streamlit.yaml and userconfig_streamlit.yaml from configs/webui/?

hlky avatar Oct 09 '22 02:10 hlky

find the setting automatically changes back to default 'GPU 0' @ZeroCool940711 can you investigate this part further? this could just be an issue with the setting being overwritten.

@timothy-WangS could you attach your webui_streamlit.yaml and userconfig_streamlit.yaml from configs/webui/?

Sure, there are those files. Since github does not allow to attach .yaml files, I simply change the filename extension to .txt. I also find the userconfig_streamlit.yaml in this folder, and parameter is set to 'gpu: 3'. However, the GPU parameter in webui.yaml and webui_streamlit.yaml have nothing changed.

webui_streamlit.txt webui.txt userconfig_streamlit.txt

timothy-WangS avatar Oct 09 '22 02:10 timothy-WangS

There are some places where the specific cuda device is not chosen, can you test this branch please

I do not have multiple gpu so it is not easy to test.

Using this branch, here occurs streamlit.errors:

Traceback (most recent call last): File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 562, in _run_script exec(code, module.dict) File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/scripts/webui_streamlit.py", line 177, in layout() File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/scripts/webui_streamlit.py", line 174, in layout layout() File "scripts/Settings.py", line 761, in layout st.session_state["defaults"].txt2vid.beta_start.format = st.number_input("Default txt2vid Beta Start Format", value=st.session_state['defaults'].txt2vid.beta_start.format, File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/runtime/metrics_util.py", line 231, in wrap result = callable(*args, **kwargs) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/elements/number_input.py", line 156, in number_input return self._number_input( File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/elements/number_input.py", line 210, in _number_input raise StreamlitAPIException( streamlit.errors.StreamlitAPIException: All numerical arguments must be of the same type. value has str type. min_value has NoneType type. max_value has NoneType type. step has NoneType type.

timothy-WangS avatar Oct 09 '22 03:10 timothy-WangS

This occurred after a recent change to the config file on dev branch. Just delete the userconfig_streamlit.yaml and it will be recreated.

On 9 Oct 2022, at 04:53, Timothy.Wang @.***> wrote:

 There are some places where the specific cuda device is not chosen, can you test this branch please

I do not have multiple gpu so it is not easy to test.

Using this branch, here occurs streamlit.errors:

Traceback (most recent call last): File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 562, in _run_script exec(code, module.dict) File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/scripts/webui_streamlit.py", line 177, in layout() File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/scripts/webui_streamlit.py", line 174, in layout layout() File "scripts/Settings.py", line 761, in layout st.session_state["defaults"].txt2vid.beta_start.format = st.number_input("Default txt2vid Beta Start Format", value=st.session_state['defaults'].txt2vid.beta_start.format, File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/runtime/metrics_util.py", line 231, in wrap result = callable(*args, **kwargs) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/elements/number_input.py", line 156, in number_input return self._number_input( File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/elements/number_input.py", line 210, in _number_input raise StreamlitAPIException( streamlit.errors.StreamlitAPIException: All numerical arguments must be of the same type. value has str type. min_value has NoneType type. max_value has NoneType type. step has NoneType type.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

hlky avatar Oct 09 '22 04:10 hlky

That does not work well on my machine. Instead, I comment the whole line of : #################### st.session_state["defaults"].txt2vid.beta_start.format = st.number_input("Default txt2vid Beta Start Format", value=st.session_state['defaults'].txt2vid.beta_start.format, help="Set the default Beta Start Format. Default is: %.5\f") #################### and #################### st.session_state["defaults"].txt2vid.beta_end.format = st.number_input("Default txt2vid Beta End Format", value=st.session_state['defaults'].txt2vid.beta_start.format, help="Set the default Beta Start Format. Default is: %.5 f") #################### It can run 'SETTINGS' now, though it seems missing some parts, would that be important?

timothy-WangS avatar Oct 09 '22 06:10 timothy-WangS

There are some places where the specific cuda device is not chosen, can you test this branch please

I do not have multiple gpu so it is not easy to test.

thx for this branch! Some bugs fixed, while still here are some bug remains.

Here is what I found in Text-to-Image when using GPU 0

  1. when clicking the "Generate" button the first time, nothing will be output. It seems the network is training/fine-tuning/loading. (not sure is this a common thing)
  2. when clicking the "Generate" button again, the picture will be generated

when choosing GPU 3 in SETTING, the first thing appears to be the same. In nvidia-smi, it is clear that after clicking 'Generate' button, GPU 3 has memory usage and gpu-util is not zero. However, there are some things unexpected.

  1. In 'SETTINGS'-'General'-'GPU', there still is 'GPU 0'. I think it might be a GUI issue.
  2. The GPU 0 still has some memory usage (i.e. GPU memory usage 1481 MB, using Stable-diffusion v1.4 model)
  3. When clicking the "Generate" button a second time (which should lead to a generated picture), there has an error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper__index_select)

here is the terminal output: Traceback (most recent call last): File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 562, in _run_script exec(code, module.__dict__) File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/scripts/webui_streamlit.py", line 178, in <module> layout() File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/scripts/webui_streamlit.py", line 142, in layout layout() File "scripts/txt2img.py", line 418, in layout output_images, seeds, info, stats = txt2img(prompt, st.session_state.sampling_steps, sampler_name, st.session_state["batch_count"], st.session_state["batch_size"], File "scripts/txt2img.py", line 136, in txt2img output_images, seed, info, stats = process_images( File "scripts/sd_utils.py", line 2154, in process_images uc = (server_state["model"] if not st.session_state['defaults'].general.optimized else server_state["modelCS"]).get_learned_conditioning(len(prompts) * [negprompt]) File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/ldm/models/diffusion/ddpm.py", line 554, in get_learned_conditioning c = self.cond_stage_model.encode(c) File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/ldm/modules/encoders/modules.py", line 166, in encode return self(text) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/data2/glanny/StableDiffusionWeb/stable-diffusion-webui/ldm/modules/encoders/modules.py", line 160, in forward outputs = self.transformer(input_ids=tokens) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py", line 722, in forward return self.text_model( File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py", line 632, in forward hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py", line 165, in forward inputs_embeds = self.token_embedding(input_ids) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward return F.embedding( File "/home/ufo/anaconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper__index_select)

timothy-WangS avatar Oct 09 '22 23:10 timothy-WangS

It looks like that the problem resides in a number of .cuda() and device="cuda" lines throughout the code. This sends the model/models to the first GPU while sometimes to.device(correct_gpu) send it to the right one. Also sd_utils.py takes the GPU number from configs/webui/webui_streamlit.yaml and there the default GPUs number are the 0.

andrea-gatto avatar Oct 12 '22 11:10 andrea-gatto

experiencing the same issue using a nvidia 1650 super as video out and tesla m40's as processing. m40s are not detected and the 1650 is perceived as the processing unit.

Tom-Neverwinter avatar Nov 07 '22 22:11 Tom-Neverwinter