stablediffusion
stablediffusion copied to clipboard
Issue with get_learned_conditioning while running your Stable Diffusion Version 2
- I followed the steps to install the repo
- Downloaded the model 768-v-ema.ckpt - as when I used the link for download from your Readme page, it doesn't download 768model.ckpt instead it downloaded 768-v-ema.ckpt
- Trying to run sample from the SD2.1-v model, run the following:
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
so in my case the command above will be something like this python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt ldm/models/768-v-ema.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
Issue is I get an error on lines 342 and 345 if opt.scale != 1.0: uc = model.get_learned_conditioning(batch_size * [""]) # <-- line 342 if isinstance(prompts, tuple): prompts = list(prompts) c = model.get_learned_conditioning(prompts) # <-- line 345
I put the code in a try... except block in your function get_learned_conditioning (from ddpm.py) And that is how I was able to capture the following error
Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and query.dtype: struct c10::BFloat16 instead.
With the above python command - it defaults to ddim sampler. Not sure it being a DDIM Sampler, it is not able to calculate the learned conditioning.
I even set up a sampler type and tried to skip calling the model.get_learned_conditioning(prompts), but then in the sample generation on line 347 samples, _ = sampler.sample(S=opt.steps, conditioning=c, batch_size=opt.n_samples, shape=shape, verbose=False, unconditional_guidance_scale=opt.scale, unconditional_conditioning=uc, eta=opt.ddim_eta, x_T=start_code) # <-- line 347 it fails as both c and uc are None
NOTE: I am running on CPU only.
Please take a look at this and respond back.
I met a similar issue.
Here is the error message when I ran python scripts/txt2img.py --n_samples=1 --prompt "a professional photograph of an astronaut riding a horse" --ckpt ../stable-diffusion-2-1/v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Creating invisible watermark encoder (see https://github.com/ShieldMnt/invisible-watermark)...
data: 0%| | 0/1 [00:00<?, ?it/s]
Sampling: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/garasubo/workspace/ai/stablediffusion/scripts/txt2img.py", line 388, in <module>
main(opt)
File "/home/garasubo/workspace/ai/stablediffusion/scripts/txt2img.py", line 342, in main
uc = model.get_learned_conditioning(batch_size * [""])
File "/home/garasubo/workspace/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 665, in get_learned_conditioning
c = self.cond_stage_model.encode(c)
File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 236, in encode
return self(text)
File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 213, in forward
z = self.encode_with_transformer(tokens.to(self.device))
File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 220, in encode_with_transformer
x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 232, in text_transformer_forward
x = r(x, attn_mask=attn_mask)
File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/open_clip/transformer.py", line 154, in forward
x = x + self.ls_1(self.attention(self.ln_1(x), attn_mask=attn_mask))
File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/open_clip/transformer.py", line 151, in attention
return self.attn(x, x, x, need_weights=False, attn_mask=attn_mask)[0]
File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/activation.py", line 1189, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/functional.py", line 5334, in multi_head_attention_forward
attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
RuntimeError: Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and query.dtype: c10::BFloat16 instead.
My environment: OS: Ubuntu 22.04 GPU: NVIDIA GeForce RTX 2060 Cuda: 11.8
@pankaja0285 if you know it's not yours, then why are you using it? Develop your own ai on your own if you think you can do it better
Add --device cuda python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt ldm/models/768-v-ema.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --device cuda
@andrewivan123 Nice, that works for me. Thank you for your kind help!
@pankaja0285 The error is caused by trying to use the proprietary cuda datatype BFloat16 during CPU-based inference.
By using full precision, aka Float32 datatype, your problem should be fixed / just add this to your command line:
--precision=full
@ChipsSpectre Thanks for the hint! It now throws:
$ python3 scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt models/v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --precision=full
Global seed set to 42
Loading model from models/v2-1_768-ema-pruned.ckpt
Global Step: 110000
LatentDiffusion: Running in v-prediction mode
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Creating invisible watermark encoder (see https://github.com/ShieldMnt/invisible-watermark)...
Sampling: 0%| | 0/3 [00:00<?, ?it/sData shape for DDIM sampling is (3, 4, 96, 96), eta 0.0 | 0/1 [00:00<?, ?it/s]
Running DDIM Sampling with 50 timesteps
DDIM Sampler: 0%| | 0/50 [00:00<?, ?it/s]
data: 0%| | 0/1 [00:01<?, ?it/s]
Sampling: 0%| | 0/3 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/mschoenebeck/ai/stablediffusion/scripts/txt2img.py", line 388, in <module>
main(opt)
File "/home/mschoenebeck/ai/stablediffusion/scripts/txt2img.py", line 347, in main
samples, _ = sampler.sample(S=opt.steps,
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 104, in sample
samples, intermediates = self.ddim_sampling(conditioning, size,
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 164, in ddim_sampling
outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 212, in p_sample_ddim
model_uncond, model_t = self.model.apply_model(x_in, t_in, c_in).chunk(2)
File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 858, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 1335, in forward
out = self.diffusion_model(x, t, context=cc)
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mschoenebeck/ai/stablediffusion/ldm/modules/diffusionmodules/openaimodel.py", line 797, in forward
h = module(h, emb, context)
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mschoenebeck/ai/stablediffusion/ldm/modules/diffusionmodules/openaimodel.py", line 86, in forward
x = layer(x)
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
The machine I am on is a VPS with a lot of CPU cores and 64 GB memory but no GPU.. any ideas how to get it to run without cuda? Any help is much appreciated.
I was able to get it running on CPU by passing the --precision full
flag as well as changing the use_fp16
parameter in the v2-inference.yaml
from use_fp16: True
to use_fp16: False
. Specifically the model.params.unet_config.params.use_fp16
key in the yaml file.
note: I was using v2-inference.yaml
not v2-inference-v.yaml
.