DeepSpeed [BUG] Triton Error [CUDA]: invalid argument

Describe the bug A clear and concise description of what the bug is. Facing this error RuntimeError: Triton Error [CUDA]: invalid argument, while using deepspeed inference for stable-diffusion model.

To Reproduce Steps to reproduce the behavior:

Simple inference script to reproduce

from diffusers import StableDiffusionPipeline
import torch
import os
import shutil

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe._progress_bar_config = {"disable": True}

import deepspeed
with torch.inference_mode():
  deepspeed.init_inference(
        model=getattr(pipe,"model", pipe),      # Transformers models
        #mp_size=1,        # Number of GPU
        dtype=torch.float16, # dtype of the weights (fp16)
        #replace_method="auto", # Lets DS autmatically identify the layer to replace
        replace_with_kernel_inject=True, # replace the model with the kernel injector
    )
  image = pipe("A Happy CEO").images[0]

What packages are required and their versions deepspeed==0.9.1+fef5aa6e diffusers==0.13.1 transformers==4.27.3 triton==2.0.0.dev20221202 accelerate==0.16.0 xformers==0.0.16 huggingface_hub==0.12.0 torch==1.13.1

Expected behavior Execute without any issue.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables towhere it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/trlx/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/opt/conda/envs/trlx/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.1+fef5aa6e, fef5aa6e, HEAD
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

OS:

 NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"

GPU count and types one Tesla T4
(if applicable) Hugging Face Transformers/Accelerate/etc. versions
Python version Python 3.10.9

Docker context Using Conda to maintain environments

Additional context Error log:

[2023-04-26 05:38:51,186] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.1+fef5aa6e, git-hash=fef5aa6e, git-branch=HEAD
[2023-04-26 05:38:51,188] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
**** found and replaced vae w. <class 'deepspeed.model_implementations.diffusers.vae.DSVAE'>
Using /home/ec2-user/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py310_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Time to load transformer_inference op: 0.08540964126586914 seconds
[2023-04-26 05:38:51,584] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Attention config: {'layer_id': 0, 'hidden_size': 320, 'intermediate_size': 1280, 'heads': 8, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': False, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 4096, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False}
Time to load transformer_inference op: 0.002669811248779297 seconds
Loading extension module transformer_inference...
Using /home/ec2-user/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Using /home/ec2-user/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ec2-user/.cache/torch_extensions/py310_cu117/spatial_inference/build.ninja...
Building extension module spatial_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module spatial_inference...
Time to load spatial_inference op: 0.08253216743469238 seconds
**** found and replaced unet w. <class 'deepspeed.model_implementations.diffusers.unet.DSUNet'>
Using /home/ec2-user/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module spatial_inference, skipping build step...
Loading extension module spatial_inference...
Time to load spatial_inference op: 0.002627134323120117 seconds
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in _fwd_kernel                                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 
('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962
222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033
f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, 
torch.float16, 'fp32', torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32',
'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, 
(False,), True, True, (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), 
(True, False), (False, True), (True, False), (True, False), (True, False), (False, True), (True, False), (True, 
False), (True, False), (False, True), (False, False), (False, False), (True, False)))

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│    7 │   │   #replace_method="auto", # Lets DS autmatically identify the layer to replace        │
│    8 │   │   replace_with_kernel_inject=True, # replace the model with the kernel injector       │
│    9 │   )                                                                                       │
│ ❱ 10   image = pipe("A Happy CEO").images[0]                                                     │
│   11                                                                                             │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27 in              │
│ decorate_context                                                                                 │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_ │
│ stable_diffusion.py:643 in __call__                                                              │
│                                                                                                  │
│   640 │   │   │   │   latent_model_input = self.scheduler.scale_model_input(latent_model_input   │
│   641 │   │   │   │                                                                              │
│   642 │   │   │   │   # predict the noise residual                                               │
│ ❱ 643 │   │   │   │   noise_pred = self.unet(                                                    │
│   644 │   │   │   │   │   latent_model_input,                                                    │
│   645 │   │   │   │   │   t,                                                                     │
│   646 │   │   │   │   │   encoder_hidden_states=prompt_embeds,                                   │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl  │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/deepspeed/model_implementations/diffusers/unet │
│ .py:44 in forward                                                                                │
│                                                                                                  │
│   41 │   │   │   │   outputs = self._graph_replay(*inputs, **kwargs)                             │
│   42 │   │   │   return outputs                                                                  │
│   43 │   │   else:                                                                               │
│ ❱ 44 │   │   │   return self._forward(*inputs, **kwargs)                                         │
│   45 │                                                                                           │
│   46 │   def _create_cuda_graph(self, *inputs, **kwargs):                                        │
│   47 │   │   # warmup to create the workspace and cublas handle                                  │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/deepspeed/model_implementations/diffusers/unet │
│ .py:73 in _forward                                                                               │
│                                                                                                  │
│   70 │   │   │   │   │   │   │    return_dict,                                                   │
│   71 │   │   │   │   │   │   │    cross_attention_kwargs=cross_attention_kwargs)                 │
│   72 │   │   else:                                                                               │
│ ❱ 73 │   │   │   return self.unet(sample, timestamp, encoder_hidden_states, return_dict)         │
│   74                                                                                             │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl  │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py:580 in   │
│ forward                                                                                          │
│                                                                                                  │
│   577 │   │   down_block_res_samples = (sample,)                                                 │
│   578 │   │   for downsample_block in self.down_blocks:                                          │
│   579 │   │   │   if hasattr(downsample_block, "has_cross_attention") and downsample_block.has   │
│ ❱ 580 │   │   │   │   sample, res_samples = downsample_block(                                    │
│   581 │   │   │   │   │   hidden_states=sample,                                                  │
│   582 │   │   │   │   │   temb=emb,                                                              │
│   583 │   │   │   │   │   encoder_hidden_states=encoder_hidden_states,                           │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl  │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py:837 in      │
│ forward                                                                                          │
│                                                                                                  │
│    834 │   │   │   │   )[0]                                                                      │
│    835 │   │   │   else:                                                                         │
│    836 │   │   │   │   hidden_states = resnet(hidden_states, temb)                               │
│ ❱  837 │   │   │   │   hidden_states = attn(                                                     │
│    838 │   │   │   │   │   hidden_states,                                                        │
│    839 │   │   │   │   │   encoder_hidden_states=encoder_hidden_states,                          │
│    840 │   │   │   │   │   cross_attention_kwargs=cross_attention_kwargs,                        │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl  │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/diffusers/models/transformer_2d.py:265 in      │
│ forward                                                                                          │
│                                                                                                  │
│   262 │   │                                                                                      │
│   263 │   │   # 2. Blocks                                                                        │
│   264 │   │   for block in self.transformer_blocks:                                              │
│ ❱ 265 │   │   │   hidden_states = block(                                                         │
│   266 │   │   │   │   hidden_states,                                                             │
│   267 │   │   │   │   encoder_hidden_states=encoder_hidden_states,                               │
│   268 │   │   │   │   timestep=timestep,                                                         │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl  │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/diffusers_ │
│ transformer_block.py:91 in forward                                                               │
│                                                                                                  │
│    88 │   │   │   context = kwargs["encoder_hidden_states"]                                      │
│    89 │   │                                                                                      │
│    90 │   │   out_norm_1 = self.transformer_cuda_module.layer_norm(hidden_states, self.norm1_g   │
│ ❱  91 │   │   out_attn_1 = self.attn_1(out_norm_1)                                               │
│    92 │   │                                                                                      │
│    93 │   │   out_norm_2, out_attn_1 = self.transformer_cuda_module.layer_norm_residual_store_   │
│    94 │   │   │   out_attn_1, self.attn_1_bias, hidden_states, self.norm2_g, self.norm2_b, sel   │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl  │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/diffusers_ │
│ attention.py:188 in forward                                                                      │
│                                                                                                  │
│   185 │   │   │   │   │   │   │   │   │   input.size()[1],                                       │
│   186 │   │   │   │   │   │   │   │   │   input.size()[0], DeepSpeedDiffusersAttention.layer_i   │
│   187 │   │   │   │   │   │   │   │   │   0, self.config.max_out_tokens, self.config.min_out_t   │
│ ❱ 188 │   │   output = DeepSpeedDiffusersAttentionFunction.apply(input, context, input_mask, s   │
│   189 │   │   │   │   │   │   │   │   │   │   │   │   │   │      self.attn_qw, self.attn_kw, s   │
│   190 │   │   │   │   │   │   │   │   │   │   │   │   │   │      self.num_attention_heads_per_   │
│   191 │   │   │   │   │   │   │   │   │   │   │   │   │   │      self.hidden_size_per_partitio   │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/diffusers_ │
│ attention.py:88 in forward                                                                       │
│                                                                                                  │
│    85 │   │   │   output = linear_func(context_layer, attn_ow, attn_ob, do_out_bias, False, co   │
│    86 │   │   │   return output                                                                  │
│    87 │   │                                                                                      │
│ ❱  88 │   │   output = selfAttention_fp(input, context, input_mask)                              │
│    89 │   │                                                                                      │
│    90 │   │   return output                                                                      │
│    91                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/diffusers_ │
│ attention.py:64 in selfAttention_fp                                                              │
│                                                                                                  │
│    61 │   │   │   │   qkv_out = linear_func(input, attn_qkvw, attn_qkvb if attn_qkvb is not No   │
│    62 │   │   │   │   │   │   │   │   │     is not None, do_flash_attn, config.heads, False)     │
│    63 │   │   │   │                                                                              │
│ ❱  64 │   │   │   │   context_layer = triton_flash_attn_kernel(qkv_out[0], qkv_out[1], qkv_out   │
│    65 │   │   │   │   │   │   │   │   │   │   │   │   │   │    input.shape[-2] % 128 == 0)       │
│    66 │   │   │   │   context_layer = _transpose_for_context(context_layer[:, :, :, :head_size   │
│    67                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in _call_impl  │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/deepspeed/ops/transformer/inference/triton_ops │
│ .py:121 in forward                                                                               │
│                                                                                                  │
│   118 │   │   tmp = torch.empty((q.shape[0] * q.shape[1], q.shape[2]), device=q.device, dtype=   │
│   119 │   │   num_warps = 4 if Lk <= 64 else 8                                                   │
│   120 │   │                                                                                      │
│ ❱ 121 │   │   _fwd_kernel[grid](                                                                 │
│   122 │   │   │   q,                                                                             │
│   123 │   │   │   k,                                                                             │
│   124 │   │   │   v,                                                                             │
│                                                                                                  │
│ /opt/conda/envs/trlx/lib/python3.10/site-packages/triton/runtime/jit.py:106 in launcher          │
│                                                                                                  │
│   103 │   │   memorizes the grid.                                                                │
│   104 │   │   """                                                                                │
│   105 │   │   def launcher(*args, **kwargs):                                                     │
│ ❱ 106 │   │   │   return self.run(*args, grid=grid, **kwargs)                                    │
│   107 │   │   return launcher                                                                    │
│   108                                                                                            │
│   109                                                                                            │
│ in _fwd_kernel                                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Triton Error [CUDA]: invalid argument

Apr 26 '23 05:04 abhijitpal1247

Hi @abhijitpal1247, perhaps my comment in another issue could be of help: https://github.com/microsoft/DeepSpeed-MII/issues/170#issuecomment-1526277566

Apr 27 '23 19:04 CrossNox

@CrossNox looks like it. #2942 and #2702 also have experienced similar issue while using T4, across different versions of deepspeed.

Apr 28 '23 06:04 abhijitpal1247

Run this in a colab to reproduce #2968 and this one

Here's some code to speedrun the error: AttributeError: 'StableDiffusionPipeline' object has no attribute 'children'

Which I believe is still not fixed.

!pip install diffusers==0.15.0 torch==1.13.1 transformers==4.28.1 triton==2.0.0.dev20221105

%cd /content/sample_data
!git clone https://github.com/microsoft/DeepSpeed.git

%cd /content/sample_data/DeepSpeed/requirements
!pip install -r requirements.txt
%cd /content/sample_data/DeepSpeed
!pip install .
!export PYTHONPATH="$PYTHONPATH:/content/sample_data/DeepSpeed"


import os, torch, diffusers, deepspeed

pipe = diffusers.StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16,
    revision="fp16",
    replace_with_kernel_inject=True # replace the model with the kernel injector 
)

model = deepspeed.init_inference(pipe.to("cuda"), dtype=torch.float16)
model("hello from here")

Here's some code to speedrun the error: RuntimeError: Triton Error [CUDA]: invalid argument

!pip install torch
!pip install diffusers==0.14.0 triton==2.0.0.dev20221202
!pip install transformers accelerate

%cd /content/sample_data
!git clone https://github.com/microsoft/DeepSpeed.git

%cd /content/sample_data/DeepSpeed/requirements
!pip install -r requirements.txt
%cd /content/sample_data/DeepSpeed
!pip install .
!export PYTHONPATH="$PYTHONPATH:/content/sample_data/DeepSpeed"

import torch
import deepspeed
from diffusers import StableDiffusionPipeline
print(deepspeed.__version__)
# load vanilla pipeline
ds_pipeline = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", 
    torch_dtype=torch.float16
).to("cuda")

# init deepspeed inference engine
deepspeed.init_inference(
    model=getattr(ds_pipeline,"model", ds_pipeline),      # Transformers models
    mp_size=1,        # Number of GPU
    dtype=torch.float16, # dtype of the weights (fp16)
    replace_method="auto", # Lets DS autmatically identify the layer to replace
    replace_with_kernel_inject=True, # replace the model with the kernel injector
)
print("DeepSpeed Inference Engine initialized")


image = ds_pipeline("a photo of an astronaut riding a horse on mars").images[0]

image.show()

Apr 28 '23 12:04 Dentoty

Can you test again with the latest DeepSpeed with the triton versions updated? If you are still seeing this, can you re-open this issue?

Jul 24 '23 17:07 loadams

I ran into a similar error. I can confirm I updated both triton (to 2.0.0) and deepspeed (to 0.10.0) but the problem persists. Here is the error message.

miniconda3/envs/py39/lib/python3.9/site-packages/triton_pre_mlir/run │ │ time/autotuner.py:200 in run │ │ │ │ 197 │ def run(self, *args, **kwargs): │ │ 198 │ │ for v, heur in self.values.items(): │ │ 199 │ │ │ kwargs[v] = heur({**dict(zip(self.arg_names, args)), **kwargs}) │ │ ❱ 200 │ │ return self.fn.run(*args, **kwargs) │ │ 201 │ │ 202 │ │ 203 def heuristics(values): │ │ in _fwd_kernel:43 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Triton Error [CUDA]: invalid argument

Aug 08 '23 20:08 hayday100

@hayday100 - for now, can you use the triton version listed in requirements-sd.txt? That version specifically works when running our unit tests.

Aug 08 '23 20:08 loadams

DeepSpeed DeepSpeed copied to clipboard

[BUG] Triton Error [CUDA]: invalid argument

DeepSpeed
DeepSpeed copied to clipboard