flashinfer icon indicating copy to clipboard operation
flashinfer copied to clipboard

RMSNorm failed with error code no kernel image is available for execution on the device

Open bobbych94 opened this issue 9 months ago • 3 comments

[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025022500
[TensorRT-LLM][INFO] Refreshed the MPI local session
Model init total -- 387.22s
2025-03-07 11:03:44,669 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-03-07 11:03:44,689 - INFO - flashinfer.jit: Finished loading JIT ops: norm
CUDA Error: no kernel image is available for execution on the device (209) /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/norm.cuh: line 113 at function cudaLaunchKernel((void*)kernel, nblks, nthrs, args, smem_size, stream)
[03/07/2025-11:03:44] [TRT-LLM] [E] Failed to initialize executor on rank 6: RMSNorm failed with error code no kernel image is available for execution on the device
[ERROR    | TRT-LLM            ]: [TRT-LLM] [E] Failed to initialize executor on rank 6: RMSNorm failed with error code no kernel image is available for execution on the device
[03/07/2025-11:03:44] [TRT-LLM] [E] Traceback (most recent call last):

  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 581, in worker_main
    worker: ExecutorBindingsWorker = worker_cls(
                                     ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 126, in __init__
    self.engine = _create_engine()
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 122, in _create_engine
    return create_executor(executor_config=executor_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 106, in create_py_executor
    kv_cache_max_tokens = estimate_max_kv_cache_tokens(model_engine,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 118, in estimate_max_kv_cache_tokens
    model_engine.forward(req, resource_manager)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 990, in forward
    return self._forward_step(inputs, gather_ids)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1030, in _forward_step
    logits = self.model.forward(**inputs,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 234, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 379, in forward
    hidden_states, residual = decoder_layer(position_ids=position_ids,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 319, in forward
    hidden_states = self.input_layernorm(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/rms_norm.py", line 32, in forward
    return flashinfer_rmsnorm(hidden_states, self.weight,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 637, in __call__
    return self._opoverload(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 723, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 305, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 32, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 738, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 337, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/custom_op.py", line 237, in flashinfer_rmsnorm
    return rmsnorm(input, weight, eps)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/norm.py", line 73, in rmsnorm
    _rmsnorm(out, input, weight, eps)
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/norm.py", line 82, in _rmsnorm
    get_norm_module().rmsnorm(out, input, weight, eps, get_cuda_stream(device))
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device

CUDA: 12.8 Pytorch: 2.6.0+cu128 GPU: H20

bobbych94 avatar Mar 07 '25 10:03 bobbych94

Seems H20's cuda architecture is not recognized, can you specify the environment variable:

export TORCH_CUDA_ARCH_LIST=9.0

yzh119 avatar Mar 08 '25 05:03 yzh119

Seems H20's cuda architecture is not recognized, can you specify the environment variable:

export TORCH_CUDA_ARCH_LIST=9.0

not work, I have tried setting the environment variable,and additionally tried both JIT and AOT

bobbych94 avatar Mar 10 '25 02:03 bobbych94

Getting the same error on GB200, TRTLLM image nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc4, cuda_12.9.r12.9.

rzhao01 avatar Oct 02 '25 05:10 rzhao01