flashinfer
flashinfer copied to clipboard
RMSNorm failed with error code no kernel image is available for execution on the device
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025022500
[TensorRT-LLM][INFO] Refreshed the MPI local session
Model init total -- 387.22s
2025-03-07 11:03:44,669 - INFO - flashinfer.jit: Loading JIT ops: norm
2025-03-07 11:03:44,689 - INFO - flashinfer.jit: Finished loading JIT ops: norm
CUDA Error: no kernel image is available for execution on the device (209) /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/norm.cuh: line 113 at function cudaLaunchKernel((void*)kernel, nblks, nthrs, args, smem_size, stream)
[03/07/2025-11:03:44] [TRT-LLM] [E] Failed to initialize executor on rank 6: RMSNorm failed with error code no kernel image is available for execution on the device
[ERROR | TRT-LLM ]: [TRT-LLM] [E] Failed to initialize executor on rank 6: RMSNorm failed with error code no kernel image is available for execution on the device
[03/07/2025-11:03:44] [TRT-LLM] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 581, in worker_main
worker: ExecutorBindingsWorker = worker_cls(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 126, in __init__
self.engine = _create_engine()
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 122, in _create_engine
return create_executor(executor_config=executor_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 106, in create_py_executor
kv_cache_max_tokens = estimate_max_kv_cache_tokens(model_engine,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/_util.py", line 118, in estimate_max_kv_cache_tokens
model_engine.forward(req, resource_manager)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 990, in forward
return self._forward_step(inputs, gather_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 1030, in _forward_step
logits = self.model.forward(**inputs,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 234, in forward
hidden_states = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 379, in forward
hidden_states, residual = decoder_layer(position_ids=position_ids,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_deepseekv3.py", line 319, in forward
hidden_states = self.input_layernorm(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/rms_norm.py", line 32, in forward
return flashinfer_rmsnorm(hidden_states, self.weight,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 637, in __call__
return self._opoverload(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 723, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 305, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 738, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 337, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/custom_op.py", line 237, in flashinfer_rmsnorm
return rmsnorm(input, weight, eps)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/norm.py", line 73, in rmsnorm
_rmsnorm(out, input, weight, eps)
File "/usr/local/lib/python3.12/dist-packages/flashinfer/norm.py", line 82, in _rmsnorm
get_norm_module().rmsnorm(out, input, weight, eps, get_cuda_stream(device))
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device
CUDA: 12.8 Pytorch: 2.6.0+cu128 GPU: H20
Seems H20's cuda architecture is not recognized, can you specify the environment variable:
export TORCH_CUDA_ARCH_LIST=9.0
Seems H20's cuda architecture is not recognized, can you specify the environment variable:
export TORCH_CUDA_ARCH_LIST=9.0
not work, I have tried setting the environment variable,and additionally tried both JIT and AOT
Getting the same error on GB200, TRTLLM image nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc4, cuda_12.9.r12.9.