stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

Could not run xformers::efficient_attention_forward_cutlass

Open wankio opened this issue 1 year ago • 9 comments

venv "C:\Users\GEN32UC\stable-diffusion-webui\venv\Scripts\Python.exe"
Python 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)]
Commit hash: cbf6dad02d04d98e5a2d5e870777ab99b5796b2d
Installing requirements for Web UI
Launching Web UI with arguments: --listen --always-batch-cond-uncond --precision full --no-half --opt-split-attention --force-enable-xformers
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loading weights [7460a6fa] from C:\Users\GEN32UC\stable-diffusion-webui\models\Stable-diffusion\model.ckpt
Global Step: 470000
Applying xformers cross attention optimization.
Model loaded.
Loading hypernetwork None
Loaded a total of 6 textual inversion embeddings.
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
  0%|                                                                                           | 0/20 [00:01<?, ?it/s]
Error completing request
Arguments: ('cat', '', 'None', 'None', 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, False, 0.7, 0, False, False, None, '', 1, '', 4, '', True, False) {}
Traceback (most recent call last):
  File "C:\Users\GEN32UC\stable-diffusion-webui\modules\ui.py", line 176, in f
    res = list(func(*args, **kwargs))
  File "C:\Users\GEN32UC\stable-diffusion-webui\webui.py", line 68, in f
    res = func(*args, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\modules\txt2img.py", line 43, in txt2img
    processed = process_images(p)
  File "C:\Users\GEN32UC\stable-diffusion-webui\modules\processing.py", line 391, in process_images
    samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength)
  File "C:\Users\GEN32UC\stable-diffusion-webui\modules\processing.py", line 518, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning)
  File "C:\Users\GEN32UC\stable-diffusion-webui\modules\sd_samplers.py", line 399, in sample
    samples = self.func(self.model_wrap_cfg, x, extra_args={'cond': conditioning, 'uncond': unconditional_conditioning, 'cond_scale': p.cfg_scale}, disable=False, callback=self.callback_state, **extra_params_kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 80, in sample_euler_ancestral
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\modules\sd_samplers.py", line 239, in forward
    x_out = self.inner_model(x_in, sigma_in, cond=cond_in)
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 112, in forward
    eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\external.py", line 138, in get_eps
    return self.inner_model.apply_model(*args, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 987, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\models\diffusion\ddpm.py", line 1410, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\openaimodel.py", line 732, in forward
    h = module(h, emb, context)
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\openaimodel.py", line 85, in forward
    x = layer(x, context)
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\attention.py", line 258, in forward
    x = block(x, context=context)
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\attention.py", line 209, in forward
    return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\util.py", line 114, in checkpoint
    return CheckpointFunction.apply(func, len(inputs), *args)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\diffusionmodules\util.py", line 127, in forward
    output_tensors = ctx.run_function(*ctx.input_tensors)
  File "C:\Users\GEN32UC\stable-diffusion-webui\repositories\stable-diffusion\ldm\modules\attention.py", line 212, in _forward
    x = self.attn1(self.norm1(x)) + x
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\GEN32UC\stable-diffusion-webui\modules\sd_hijack_optimizations.py", line 145, in xformers_attention_forward
    out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None)
  File "c:\users\gen32uc\stable-diffusion-webui\repositories\xformers\xformers\ops.py", line 862, in memory_efficient_attention
    return op.forward_no_grad(
  File "c:\users\gen32uc\stable-diffusion-webui\repositories\xformers\xformers\ops.py", line 305, in forward_no_grad
    return cls.FORWARD_OPERATOR(
  File "C:\Users\GEN32UC\stable-diffusion-webui\venv\lib\site-packages\torch\_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'xformers::efficient_attention_forward_cutlass' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'xformers::efficient_attention_forward_cutlass' is only available for these backends: [UNKNOWN_TENSOR_TYPE_ID, QuantizedXPU, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseCPU, SparseCUDA, SparseHIP, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseVE, UNKNOWN_TENSOR_TYPE_ID, NestedTensorCUDA, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID].

BackendSelect: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:133 [backend fallback]
Named: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\ZeroTensorFallback.cpp:86 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at C:\Users\circleci\project\functorch\csrc\DynamicLayer.cpp:487 [backend fallback]
ADInplaceOrView: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:35 [backend fallback]
AutogradCPU: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:39 [backend fallback]
AutogradCUDA: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:47 [backend fallback]
AutogradXLA: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:51 [backend fallback]
AutogradMPS: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:59 [backend fallback]
AutogradXPU: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:43 [backend fallback]
AutogradHPU: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:68 [backend fallback]
AutogradLazy: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:55 [backend fallback]
Tracer: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\autograd\TraceTypeManual.cpp:295 [backend fallback]
AutocastCPU: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\autocast_mode.cpp:481 [backend fallback]
Autocast: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\autocast_mode.cpp:324 [backend fallback]
FuncTorchBatched: registered at C:\Users\circleci\project\functorch\csrc\LegacyBatchingRegistrations.cpp:661 [backend fallback]
FuncTorchVmapMode: fallthrough registered at C:\Users\circleci\project\functorch\csrc\VmapModeRegistrations.cpp:24 [backend fallback]
Batched: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\BatchingRegistrations.cpp:1064 [backend fallback]
VmapMode: fallthrough registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at C:\Users\circleci\project\functorch\csrc\TensorWrapper.cpp:187 [backend fallback]
Functionalize: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\FunctionalizeFallbackKernel.cpp:89 [backend fallback]
PythonTLSSnapshot: registered at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:137 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at C:\Users\circleci\project\functorch\csrc\DynamicLayer.cpp:483 [backend fallback]

today i decided to try xformer, after many failed install, after all, it successful installed. When i press generate, it just have above error CUDA lastest, before xformers installed and run with command, everything just work normal.

wankio avatar Oct 09 '22 12:10 wankio

do you have Cutlass installed?

conda install cutlass or

pip install cutlass

either you can try and install cutlass,

or you can uninstall xformers

pip uninstall xformers

Thomas-MMJ avatar Oct 09 '22 17:10 Thomas-MMJ

well i just delete xformers folder and recomplie, with torch_cuda_arch_list, it worked now i think keep install it on exist folder(even it dont have anything inside caused the problem)

wankio avatar Oct 10 '22 04:10 wankio

If anyone is having trouble with that in Docker, that helped me: (change TORCH_CUDA_ARCH_LIST to your value). 8.6 is for RTX 3060

RUN git clone https://github.com/facebookresearch/xformers/ repositories/xformers && cd repositories/xformers && git submodule update --init --recursive

RUN apt install -y g++
RUN cd repositories/xformers && \
    export FORCE_CUDA="1" && \
    export TORCH_CUDA_ARCH_LIST=8.6 && \
    CUDA_VISIBLE_DEVICES=0 pip install --verbose --no-deps -e .

luckyycode avatar Oct 11 '22 00:10 luckyycode

If anyone is having trouble with that in Docker, that helped me: (change TORCH_CUDA_ARCH_LIST to your value). 8.6 is for RTX 3060

RUN git clone https://github.com/facebookresearch/xformers/ repositories/xformers && cd repositories/xformers && git submodule update --init --recursive

RUN apt install -y g++
RUN cd repositories/xformers && \
    export FORCE_CUDA="1" && \
    export TORCH_CUDA_ARCH_LIST=8.6 && \
    CUDA_VISIBLE_DEVICES=0 pip install --verbose --no-deps -e .

i have 3090ti, 20.04.1-Ubuntu, and run: export FORCE_CUDA="1" && export TORCH_CUDA_ARCH_LIST=11.6 && CUDA_VISIBLE_DEVICES=0 pip install --verbose --no-deps -e . but i got an error:

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///home/kai/my_download/stable-diffusion-webui/repositories/xformers
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/kai/my_download/stable-diffusion-webui/repositories/xformers/setup.py", line 304, in <module>
          ext_modules=get_extensions(),
        File "/home/kai/my_download/stable-diffusion-webui/repositories/xformers/setup.py", line 251, in get_extensions
          ext_modules += get_flash_attention_extensions(
        File "/home/kai/my_download/stable-diffusion-webui/repositories/xformers/setup.py", line 117, in get_flash_attention_extensions
          num = 10 * int(arch[0]) + int(arch[2])
      ValueError: invalid literal for int() with base 10: '.'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

kkimmm avatar Dec 13 '22 03:12 kkimmm

If anyone is having trouble with that in Docker, that helped me: (change TORCH_CUDA_ARCH_LIST to your value). 8.6 is for RTX 3060

RUN git clone https://github.com/facebookresearch/xformers/ repositories/xformers && cd repositories/xformers && git submodule update --init --recursive

RUN apt install -y g++
RUN cd repositories/xformers && \
    export FORCE_CUDA="1" && \
    export TORCH_CUDA_ARCH_LIST=8.6 && \
    CUDA_VISIBLE_DEVICES=0 pip install --verbose --no-deps -e .

If you're on a local Ubuntu or Ubuntu Desktop instance please see this issue instead first: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4942. I will add details there of some cleanup I had to do after attempting the fix from this PR. cc @kkimmm

jpollard-cs avatar Dec 21 '22 14:12 jpollard-cs

Also I'm unsure why you'd want to set CUDA_VISIBLE_DEVICES to 0 unless you don't have any NVIDIA GPUs (which you indicated you had an RTX 3060). If I understand correctly this would result in a build that does not leverage your GPU.

jpollard-cs avatar Dec 21 '22 14:12 jpollard-cs

Also I'm unsure why you'd want to set CUDA_VISIBLE_DEVICES to 0 unless you don't have any NVIDIA GPUs (which you indicated you had an RTX 3060). If I understand correctly this would result in a build that does not leverage your GPU.

CUDA_VISIBLE_DEVICES is a list of CUDA DEVICE ID slots,

https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/

they are number 0, 1, 2 etc.

Thomas-MMJ avatar Dec 21 '22 17:12 Thomas-MMJ

Also I'm unsure why you'd want to set CUDA_VISIBLE_DEVICES to 0 unless you don't have any NVIDIA GPUs (which you indicated you had an RTX 3060). If I understand correctly this would result in a build that does not leverage your GPU.

CUDA_VISIBLE_DEVICES is a list of CUDA DEVICE ID slots,

https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/

they are number 0, 1, 2 etc.

Ah okay got it. Looks like I read some misguided information on this. Thanks for the clarification @Thomas-MMJ

jpollard-cs avatar Dec 22 '22 15:12 jpollard-cs

@kkimmm

If anyone is having trouble with that in Docker, that helped me: (change TORCH_CUDA_ARCH_LIST to your value). 8.6 is for RTX 3060

RUN git clone https://github.com/facebookresearch/xformers/ repositories/xformers && cd repositories/xformers && git submodule update --init --recursive

RUN apt install -y g++
RUN cd repositories/xformers && \
    export FORCE_CUDA="1" && \
    export TORCH_CUDA_ARCH_LIST=8.6 && \
    CUDA_VISIBLE_DEVICES=0 pip install --verbose --no-deps -e .

i have 3090ti, 20.04.1-Ubuntu, and run: export FORCE_CUDA="1" && export TORCH_CUDA_ARCH_LIST=11.6 && CUDA_VISIBLE_DEVICES=0 pip install --verbose --no-deps -e . but i got an error:

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///home/kai/my_download/stable-diffusion-webui/repositories/xformers
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/kai/my_download/stable-diffusion-webui/repositories/xformers/setup.py", line 304, in <module>
          ext_modules=get_extensions(),
        File "/home/kai/my_download/stable-diffusion-webui/repositories/xformers/setup.py", line 251, in get_extensions
          ext_modules += get_flash_attention_extensions(
        File "/home/kai/my_download/stable-diffusion-webui/repositories/xformers/setup.py", line 117, in get_flash_attention_extensions
          num = 10 * int(arch[0]) + int(arch[2])
      ValueError: invalid literal for int() with base 10: '.'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

I believe arch list is not meant to be your cuda version - refer to https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

chris-aeviator avatar Jan 06 '23 14:01 chris-aeviator

So how to fix it?

kopyl avatar Apr 01 '23 06:04 kopyl

Closing as stale.

catboxanon avatar Aug 03 '23 18:08 catboxanon