stablediffusion icon indicating copy to clipboard operation
stablediffusion copied to clipboard

Issue with get_learned_conditioning while running your Stable Diffusion Version 2

Open pankaja0285 opened this issue 1 year ago • 7 comments

  • I followed the steps to install the repo
  • Downloaded the model 768-v-ema.ckpt - as when I used the link for download from your Readme page, it doesn't download 768model.ckpt instead it downloaded 768-v-ema.ckpt
  • Trying to run sample from the SD2.1-v model, run the following:

python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

so in my case the command above will be something like this python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt ldm/models/768-v-ema.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

Issue is I get an error on lines 342 and 345 if opt.scale != 1.0: uc = model.get_learned_conditioning(batch_size * [""]) # <-- line 342 if isinstance(prompts, tuple): prompts = list(prompts) c = model.get_learned_conditioning(prompts) # <-- line 345

I put the code in a try... except block in your function get_learned_conditioning (from ddpm.py) And that is how I was able to capture the following error

Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and query.dtype: struct c10::BFloat16 instead.

With the above python command - it defaults to ddim sampler. Not sure it being a DDIM Sampler, it is not able to calculate the learned conditioning.

I even set up a sampler type and tried to skip calling the model.get_learned_conditioning(prompts), but then in the sample generation on line 347 samples, _ = sampler.sample(S=opt.steps, conditioning=c, batch_size=opt.n_samples, shape=shape, verbose=False, unconditional_guidance_scale=opt.scale, unconditional_conditioning=uc, eta=opt.ddim_eta, x_T=start_code) # <-- line 347 it fails as both c and uc are None

NOTE: I am running on CPU only.

Please take a look at this and respond back.

pankaja0285 avatar Mar 16 '23 18:03 pankaja0285

I met a similar issue. Here is the error message when I ran python scripts/txt2img.py --n_samples=1 --prompt "a professional photograph of an astronaut riding a horse" --ckpt ../stable-diffusion-2-1/v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768

Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Creating invisible watermark encoder (see https://github.com/ShieldMnt/invisible-watermark)...
data:   0%|                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]
Sampling:   0%|                                                                                                                                                                    | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/garasubo/workspace/ai/stablediffusion/scripts/txt2img.py", line 388, in <module>
    main(opt)
  File "/home/garasubo/workspace/ai/stablediffusion/scripts/txt2img.py", line 342, in main
    uc = model.get_learned_conditioning(batch_size * [""])
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 665, in get_learned_conditioning
    c = self.cond_stage_model.encode(c)
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 236, in encode
    return self(text)
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 213, in forward
    z = self.encode_with_transformer(tokens.to(self.device))
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 220, in encode_with_transformer
    x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
  File "/home/garasubo/workspace/ai/stablediffusion/ldm/modules/encoders/modules.py", line 232, in text_transformer_forward
    x = r(x, attn_mask=attn_mask)
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/open_clip/transformer.py", line 154, in forward
    x = x + self.ls_1(self.attention(self.ln_1(x), attn_mask=attn_mask))
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/open_clip/transformer.py", line 151, in attention
    return self.attn(x, x, x, need_weights=False, attn_mask=attn_mask)[0]
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/modules/activation.py", line 1189, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
  File "/home/garasubo/.pyenv/versions/3.10.1/lib/python3.10/site-packages/torch/nn/functional.py", line 5334, in multi_head_attention_forward
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
RuntimeError: Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and  query.dtype: c10::BFloat16 instead.

My environment: OS: Ubuntu 22.04 GPU: NVIDIA GeForce RTX 2060 Cuda: 11.8

garasubo avatar Mar 26 '23 14:03 garasubo

@pankaja0285 if you know it's not yours, then why are you using it? Develop your own ai on your own if you think you can do it better

moefear85 avatar Mar 30 '23 09:03 moefear85

Add --device cuda python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt ldm/models/768-v-ema.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --device cuda

andrewivan123 avatar Apr 01 '23 13:04 andrewivan123

@andrewivan123 Nice, that works for me. Thank you for your kind help!

garasubo avatar Apr 02 '23 09:04 garasubo

@pankaja0285 The error is caused by trying to use the proprietary cuda datatype BFloat16 during CPU-based inference.

By using full precision, aka Float32 datatype, your problem should be fixed / just add this to your command line:

--precision=full

ChipsSpectre avatar Jun 04 '23 19:06 ChipsSpectre

@ChipsSpectre Thanks for the hint! It now throws:

$ python3 scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt models/v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --precision=full
Global seed set to 42
Loading model from models/v2-1_768-ema-pruned.ckpt
Global Step: 110000
LatentDiffusion: Running in v-prediction mode
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Creating invisible watermark encoder (see https://github.com/ShieldMnt/invisible-watermark)...
Sampling:   0%|                                                                                                                                                                      | 0/3 [00:00<?, ?it/sData shape for DDIM sampling is (3, 4, 96, 96), eta 0.0                                                                                                                               | 0/1 [00:00<?, ?it/s]
Running DDIM Sampling with 50 timesteps
DDIM Sampler:   0%|                                                                                                                                                                 | 0/50 [00:00<?, ?it/s]
data:   0%|                                                                                                                                                                          | 0/1 [00:01<?, ?it/s]
Sampling:   0%|                                                                                                                                                                      | 0/3 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/mschoenebeck/ai/stablediffusion/scripts/txt2img.py", line 388, in <module>
    main(opt)
  File "/home/mschoenebeck/ai/stablediffusion/scripts/txt2img.py", line 347, in main
    samples, _ = sampler.sample(S=opt.steps,
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 104, in sample
    samples, intermediates = self.ddim_sampling(conditioning, size,
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 164, in ddim_sampling
    outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddim.py", line 212, in p_sample_ddim
    model_uncond, model_t = self.model.apply_model(x_in, t_in, c_in).chunk(2)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 858, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/models/diffusion/ddpm.py", line 1335, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/modules/diffusionmodules/openaimodel.py", line 797, in forward
    h = module(h, emb, context)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mschoenebeck/ai/stablediffusion/ldm/modules/diffusionmodules/openaimodel.py", line 86, in forward
    x = layer(x)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/mschoenebeck/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (float) should be the same

The machine I am on is a VPS with a lot of CPU cores and 64 GB memory but no GPU.. any ideas how to get it to run without cuda? Any help is much appreciated.

mschoenebeck avatar Jul 30 '23 22:07 mschoenebeck

I was able to get it running on CPU by passing the --precision full flag as well as changing the use_fp16 parameter in the v2-inference.yaml from use_fp16: True to use_fp16: False. Specifically the model.params.unet_config.params.use_fp16 key in the yaml file.

note: I was using v2-inference.yaml not v2-inference-v.yaml.

dmille avatar Aug 04 '23 00:08 dmille