TensorRT-LLM moe kernel Assertion failed when running qwen2-moe-57B-A14B with TP enabled

I am using trtllm 0.8.0 (added moe support following llama's implementation). we serve models with trtllm_backend (docker images triton-trtllm-24.02)

qwen2-moe-57B-A14B can run well on single nvidia-A800. But, if we run it with tp=2 (two A800 or L40), here is what we got:

I0709 12:38:32.372394 271 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8101
I0709 12:38:32.372581 271 http_server.cc:4685] Started HTTPService at 0.0.0.0:8100
I0709 12:38:32.424085 271 http_server.cc:320] Started Metrics Service at 0.0.0.0:8102
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_kernels_template.h:122)
1       0x7fa66c2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fa66c5f4273 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0xb0b273) [0x7fa66c5f4273]
3       0x7fa66c61561f void tensorrt_llm::MoeGemmRunner<__half, unsigned char>::runGemm<tensorrt_llm::cutlass_extensions::EpilogueOpDefault>(__half const*, unsigned char const*, __half const*, __half const*, __half*, long*, long, long, long, int, CUstream_st*) + 591
4       0x7fa66dce0f67 tensorrt_llm::kernels::CutlassMoeFCRunner<__half, unsigned char, void>::runMoe(void const*, float const*, void const*, void const*, void const*, tensorrt_llm::ActivationType, void const*, void const*, void const*, int, int, int, int, int, char*, void*, void*, bool const*, int, void*, int*, int*, tensorrt_llm::kernels::MOEParallelismConfig, tensorrt_llm::kernels::MOEExpertScaleNormalizationMode, CUstream_st*) + 1751
5       0x7fa750f1ad9a tensorrt_llm::plugins::MixtureOfExpertsPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 954
6       0x7fa627706ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fa627706ba9]
7       0x7fa6276dc6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fa6276dc6af]
8       0x7fa6276de320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fa6276de320]
9       0x7fa66e145a7b tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 59
10      0x7fa66e147714 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1188
11      0x7fa66e14d724 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3716
12      0x7fa66e11da68 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 56
13      0x7fa66e1227c7 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 247
14      0x7fa76e4b0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa76e4b0253]
15      0x7fa76e158ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa76e158ac3]
16      0x7fa76e1ea850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fa76e1ea850]

this error occurs when sending large amount of requests to tritonserver.

here is my build config for two A800:

python build.py --hf_model_dir /data/cbs/models/Qwen2-57B-A14B-Instruct \
                --dtype float16 \
    --use_inflight_batching \
    --use_gpt_attention_plugin float16 \
    --enable_context_fmha \
    --use_gemm_plugin float16 \
    --max_batch_size 128 \
    --max_input_len 6144 \
    --max_output_len 1024 \
    --max_num_tokens 204800 \
    --use_weight_only \
    --weight_only_precision int8 \
    --tp_size 2 \
    --world_size 2 \
    --tp_mode 2 \
    --output_dir /data/cbs/engines/Qwen2-57B-A14B-Instruct

looking for help, it maybe a bug in moe kernel.

Jul 09 '24 13:07 WingEdge777

Hi @handoku , could you please try trtllm 0.11.0 with triton-trtllm-24.07 to see if this issue still exists?

Aug 04 '24 13:08 QiJune

@QiJune It seems that 0.11.0 has some minor bugs for qwen2-moe int8 weight only quantization. First, I got AttributeError: 'PretrainedConfig' object has no attribute 'moe'

[08/04/2024-21:33:19] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 470.161.03. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
0.11.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:06<00:00,  1.30it/s]
Traceback (most recent call last):
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 340, in <module>
    main()
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 332, in main
    convert_and_save_hf(args)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 288, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 295, in execute
    f(args, rank)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 275, in convert_and_save_rank
    qwen = from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1265, in from_hugging_face
    qwen = cls.from_config(pretrained_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 400, in from_config
    return cls(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 361, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 220, in __init__
    transformer = QWenModel(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 167, in __init__
    self.layers = DecoderLayerList(QWenDecoderLayer, config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 289, in __init__
    super().__init__([cls(config, idx) for idx in self.layer_list])
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 289, in <listcomp>
    super().__init__([cls(config, idx) for idx in self.layer_list])
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 68, in __init__
    "moe_config": config.moe,
AttributeError: 'PretrainedConfig' object has no attribute 'moe'

After borrowed QwenConfig from main branch, I got this:

root@host:~/workspace/TensorRT-LLM-0.11.0/examples/qwen# python convert_checkpoint.py --model_dir /root/workspace/data_dir/Qwen1.5-MoE-A2.7B-Chat                               --output_dir /root/workspace/data_dir/agent/trtllm_ckpt/Qwen1.5-MoE-A2.7B-Chat                               --dtype float16                               --use_weight_only                               --weight_only_precision int8 --load_model_on_cpu
[08/04/2024-22:00:52] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 470.161.03. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
0.11.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.19it/s]
Weights loaded. Total time: 00:02:52
Traceback (most recent call last):
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 340, in <module>
    main()
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 332, in main
    convert_and_save_hf(args)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 288, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 295, in execute
    f(args, rank)
  File "/root/workspace/TensorRT-LLM-0.11.0/examples/qwen/convert_checkpoint.py", line 275, in convert_and_save_rank
    qwen = from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1278, in from_hugging_face
    qwen.load(weights)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 439, in load
    raise RuntimeError(
RuntimeError: Required but not provided tensors:{'transformer.layers.16.shared_expert_gate.per_channel_scale', 'transformer.layers.10.mlp.router.per_channel_scale', 'transformer.layers.0.shared_expert_gate.per_channel_scale', 'transformer.layers.6.shared_expert_gate.per_channel_scale', 'transformer.layers.2.shared_expert_gate.per_channel_scale', 'transformer.layers.3.shared_expert_gate.per_channel_scale', 'transformer.layers.21.shared_expert_gate.per_channel_scale', 'transformer.layers.8.mlp.router.per_channel_scale', 'transformer.layers.14.mlp.router.per_channel_scale', 'transformer.layers.20.mlp.router.per_channel_scale', 'transformer.layers.18.shared_expert_gate.per_channel_scale', 'transformer.layers.7.shared_expert_gate.per_channel_scale', 'transformer.layers.19.mlp.router.per_channel_scale', 'transformer.layers.4.shared_expert_gate.per_channel_scale', 'transformer.layers.9.mlp.router.per_channel_scale', 'transformer.layers.21.mlp.router.per_channel_scale', 'transformer.vocab_embedding.per_token_scale', 'transformer.layers.2.mlp.router.per_channel_scale', 'transformer.layers.17.shared_expert_gate.per_channel_scale', 'transformer.layers.20.shared_expert_gate.per_channel_scale', 'transformer.layers.16.mlp.router.per_channel_scale', 'transformer.layers.5.shared_expert_gate.per_channel_scale', 'transformer.layers.13.shared_expert_gate.per_channel_scale', 'transformer.layers.8.shared_expert_gate.per_channel_scale', 'lm_head.per_channel_scale', 'transformer.layers.5.mlp.router.per_channel_scale', 'transformer.layers.12.mlp.router.per_channel_scale', 'transformer.layers.6.mlp.router.per_channel_scale', 'transformer.layers.23.mlp.router.per_channel_scale', 'transformer.layers.15.shared_expert_gate.per_channel_scale', 'transformer.layers.3.mlp.router.per_channel_scale', 'transformer.layers.1.mlp.router.per_channel_scale', 'transformer.layers.9.shared_expert_gate.per_channel_scale', 'transformer.layers.1.shared_expert_gate.per_channel_scale', 'transformer.layers.12.shared_expert_gate.per_channel_scale', 'transformer.layers.7.mlp.router.per_channel_scale', 'transformer.layers.0.mlp.router.per_channel_scale', 'transformer.layers.4.mlp.router.per_channel_scale', 'transformer.layers.19.shared_expert_gate.per_channel_scale', 'transformer.layers.23.shared_expert_gate.per_channel_scale', 'transformer.layers.17.mlp.router.per_channel_scale', 'transformer.layers.10.shared_expert_gate.per_channel_scale', 'transformer.layers.22.shared_expert_gate.per_channel_scale', 'transformer.layers.18.mlp.router.per_channel_scale', 'transformer.layers.15.mlp.router.per_channel_scale', 'transformer.layers.22.mlp.router.per_channel_scale', 'transformer.layers.13.mlp.router.per_channel_scale', 'transformer.layers.11.shared_expert_gate.per_channel_scale', 'transformer.layers.14.shared_expert_gate.per_channel_scale', 'transformer.layers.11.mlp.router.per_channel_scale'}
Exception ignored in: <function PretrainedModel.__del__ at 0x7f8f229e5000>

However, router and shared_expert_gate should already be excluded by this

I am still working on this, I will appreciate that if anyone has a quick-fix advice. Thanks in advance.

update : after set exclude_module = ['*lm_head', '*router', '*vocab_embedding', '*shared_expert_gate'], convert_checkpoint and trtllm-build succeeded.

Aug 05 '24 07:08 WingEdge777

@handoku Thanks for reporting this MoE kernel assertion issue with Qwen2 MoE 57B-A14B! And sorry about the very delayed response.

Are you still exploring this issue or experiencing the MoE kernel assertion failure? Since you originally reported this issue, there have been significant improvements to MoE support in TensorRT-LLM. If you're still encountering this assertion failure, could you please let us know?

I'll mark this as "waiting for feedback" so it can be automatically marked as stale if no feedback is received within 14 days. Simply leaving any comment will prevent the stale process from happening.

Aug 20 '25 22:08 karljang

I just leave that behind; it no longer matters. But I believe that you resolved this problem since so much time has elapsed, and many efforts you and the Team have put in. So I will close this issue.

Aug 21 '25 02:08 WingEdge777