Cannot build engine w. FP8 FMHA on 5090
On Blackwell I get error
[TensorRT-LLM][ERROR] Assertion failed: FP8 FMHA cannot be enabled except on Ada or Hopper or Blackwell Arch.
Engine build cmd:
trtllm-build \
--checkpoint_dir ./mllama_checkpoint_fp8 \
--use_fused_mlp enable \
--gpt_attention_plugin auto \
--output_dir ./trt_engines_fp8 \
--max_batch_size 1 \
--max_seq_len 4096 \
--reduce_fusion enable \
--gemm_plugin auto \
--workers 16 \
--max_num_tokens 2048 \
--multiple_profiles disable
Bug here
@indra83, could you provide which GPU silicon you are working on? e.g., B100/B200/RTX 5090/GB200 etc which help us to reproduce?
I'm using an RTX 5090.
@indra83 could you try python3 ./scripts/build_wheel.py --cuda_architectures "100-real" --trt_root /usr/local/tensorrt when you install TRT-LLM?
I get the same error. Here is the output of the engine build command.
root@174d3b87a87a:/code/tensorrt_llm/examples/llama# trtllm-build --checkpoint_dir ./mllama_checkpoint_fp8 --use_fused_mlp enable --gpt_attention_plugin auto --output_dir ./trt_engines_fp8b --max_batch_size 1 --max_seq_len 4096 --reduce_fusion enable --gemm_plugin auto --workers 16 --max_num_tokens 2048 --multiple_profiles disable
2025-03-07 06:37:06,731 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025022500
[03/07/2025-06:37:06] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set bert_attention_plugin to auto.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set gemm_plugin to auto.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set gemm_plugin to auto.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set nccl_plugin to auto.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set nccl_plugin to auto.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set lora_plugin to None.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set lora_plugin to None.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set dora_plugin to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set dora_plugin to False.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set moe_plugin to auto.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set moe_plugin to auto.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set context_fmha to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set context_fmha to True.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set remove_input_padding to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set remove_input_padding to True.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set norm_quant_fusion to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set norm_quant_fusion to False.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set reduce_fusion to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set reduce_fusion to True.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set user_buffer to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set user_buffer to False.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set tokens_per_block to 32.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set tokens_per_block to 32.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set use_paged_context_fmha to True.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set use_fp8_context_fmha to True.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set fuse_fp4_quant to False.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set multiple_profiles to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set multiple_profiles to False.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set paged_state to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set paged_state to True.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set streamingllm to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set streamingllm to False.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set use_fused_mlp to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set use_fused_mlp to True.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set pp_reduce_scatter to False.
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.23.2'}
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.23.2'}
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.share_embedding_table = False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.share_embedding_table = False
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.fc_after_embed = False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.fc_after_embed = False
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_input_layernorm_in_first_layer = True
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_input_layernorm_in_first_layer = True
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_last_layernorm = True
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_last_layernorm = True
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.layer_idx_offset = 0
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.layer_idx_offset = 0
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.has_partial_lora_mask = False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.has_partial_lora_mask = False
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False
[03/07/2025-06:37:06] [TRT-LLM] [W] Implicitly setting LLaMAConfig.model_type = llama
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Implicitly setting LLaMAConfig.model_type = llama
[03/07/2025-06:37:06] [TRT-LLM] [I] Compute capability: (12, 0)
[INFO | TRT-LLM ]: [TRT-LLM] [I] Compute capability: (12, 0)
[03/07/2025-06:37:06] [TRT-LLM] [I] SM count: 170
[INFO | TRT-LLM ]: [TRT-LLM] [I] SM count: 170
[03/07/2025-06:37:06] [TRT-LLM] [I] SM clock: 3135 MHz
[INFO | TRT-LLM ]: [TRT-LLM] [I] SM clock: 3135 MHz
[03/07/2025-06:37:06] [TRT-LLM] [I] int4 TFLOPS: 0
[INFO | TRT-LLM ]: [TRT-LLM] [I] int4 TFLOPS: 0
[03/07/2025-06:37:06] [TRT-LLM] [I] int8 TFLOPS: 0
[INFO | TRT-LLM ]: [TRT-LLM] [I] int8 TFLOPS: 0
[03/07/2025-06:37:06] [TRT-LLM] [I] fp8 TFLOPS: 0
[INFO | TRT-LLM ]: [TRT-LLM] [I] fp8 TFLOPS: 0
[03/07/2025-06:37:06] [TRT-LLM] [I] float16 TFLOPS: 0
[INFO | TRT-LLM ]: [TRT-LLM] [I] float16 TFLOPS: 0
[03/07/2025-06:37:06] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[INFO | TRT-LLM ]: [TRT-LLM] [I] bfloat16 TFLOPS: 0
[03/07/2025-06:37:06] [TRT-LLM] [I] float32 TFLOPS: 0
[INFO | TRT-LLM ]: [TRT-LLM] [I] float32 TFLOPS: 0
[03/07/2025-06:37:06] [TRT-LLM] [I] Total Memory: 31 GiB
[INFO | TRT-LLM ]: [TRT-LLM] [I] Total Memory: 31 GiB
[03/07/2025-06:37:06] [TRT-LLM] [I] Memory clock: 14001 MHz
[INFO | TRT-LLM ]: [TRT-LLM] [I] Memory clock: 14001 MHz
[03/07/2025-06:37:06] [TRT-LLM] [I] Memory bus width: 512
[INFO | TRT-LLM ]: [TRT-LLM] [I] Memory bus width: 512
[03/07/2025-06:37:06] [TRT-LLM] [I] Memory bandwidth: 1792 GB/s
[INFO | TRT-LLM ]: [TRT-LLM] [I] Memory bandwidth: 1792 GB/s
[03/07/2025-06:37:06] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[INFO | TRT-LLM ]: [TRT-LLM] [I] PCIe speed: 2500 Mbps
[03/07/2025-06:37:06] [TRT-LLM] [I] PCIe link width: 16
[INFO | TRT-LLM ]: [TRT-LLM] [I] PCIe link width: 16
[03/07/2025-06:37:06] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[INFO | TRT-LLM ]: [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[03/07/2025-06:37:06] [TRT-LLM] [I] Set dtype to bfloat16.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set dtype to bfloat16.
[03/07/2025-06:37:06] [TRT-LLM] [I] Set paged_kv_cache to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set paged_kv_cache to True.
[03/07/2025-06:37:06] [TRT-LLM] [W] Overriding paged_state to False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Overriding paged_state to False
[03/07/2025-06:37:06] [TRT-LLM] [I] Set paged_state to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set paged_state to False.
[03/07/2025-06:37:06] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[WARNING | TRT-LLM ]: [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[03/07/2025-06:37:06] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[WARNING | TRT-LLM ]: [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[03/07/2025-06:37:06] [TRT-LLM] [W] Overriding reduce_fusion to False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Overriding reduce_fusion to False
[03/07/2025-06:37:06] [TRT-LLM] [I] Set reduce_fusion to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set reduce_fusion to False.
[03/07/2025-06:37:14] [TRT] [I] [MemUsageChange] Init CUDA: CPU -24, GPU +0, now: CPU 3965, GPU 590 (MiB)
[03/07/2025-06:37:15] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +3099, GPU +488, now: CPU 7266, GPU 1078 (MiB)
[03/07/2025-06:37:15] [TRT-LLM] [I] Set nccl_plugin to None.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set nccl_plugin to None.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: FP8 FMHA cannot be enabled except on Ada or Hopper or Blackwell Arch. (/code/tensorrt_llm/cpp/tensorrt_llm/common/attentionOp.cpp:2357)
1 0x7009a217b67c tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95
2 0x7009a233c6c4 tensorrt_llm::common::op::AttentionOp::initialize() + 5492
3 0x7007b54d96be tensorrt_llm::plugins::GPTAttentionPlugin* tensorrt_llm::plugins::GPTAttentionPluginCommon::cloneImpl<tensorrt_llm::plugins::GPTAttentionPlugin>() const + 1134
4 0x700ce283f620 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x17eb620) [0x700ce283f620]
5 0x700ce2764596 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1710596) [0x700ce2764596]
6 0x700d09f332e2 /usr/local/lib/python3.12/dist-packages/tensorrt/tensorrt.so(+0xf62e2) [0x700d09f332e2]
7 0x700d09e86ae2 /usr/local/lib/python3.12/dist-packages/tensorrt/tensorrt.so(+0x49ae2) [0x700d09e86ae2]
@indra83 let me find internal 5090 resource to reproduce, thanks!
@indra83 I can reproduce on 5090, will follow through. Meanwhile, on B100 with FP8 the engine running fine
[03/15/2025-08:09:16] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set gemm_plugin to auto.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set nccl_plugin to auto.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set lora_plugin to None.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set dora_plugin to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set moe_plugin to auto.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set context_fmha to True.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set remove_input_padding to True.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set norm_quant_fusion to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set reduce_fusion to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set user_buffer to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set tokens_per_block to 32.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set use_fp8_context_fmha to True.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set multiple_profiles to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set paged_state to True.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set streamingllm to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set use_fused_mlp to True.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.25.0'}
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.share_embedding_table = False
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.fc_after_embed = False
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_input_layernorm_in_first_layer = True
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_last_layernorm = True
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.layer_idx_offset = 0
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.has_partial_lora_mask = False
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False
[03/15/2025-08:09:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.model_type = llama
[03/15/2025-08:09:16] [TRT-LLM] [I] Compute capability: (10, 0)
[03/15/2025-08:09:16] [TRT-LLM] [I] SM count: 148
[03/15/2025-08:09:16] [TRT-LLM] [I] SM clock: 1965 MHz
[03/15/2025-08:09:16] [TRT-LLM] [I] int4 TFLOPS: 0
[03/15/2025-08:09:16] [TRT-LLM] [I] int8 TFLOPS: 0
[03/15/2025-08:09:16] [TRT-LLM] [I] fp8 TFLOPS: 0
[03/15/2025-08:09:16] [TRT-LLM] [I] float16 TFLOPS: 0
[03/15/2025-08:09:16] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[03/15/2025-08:09:16] [TRT-LLM] [I] float32 TFLOPS: 0
[03/15/2025-08:09:16] [TRT-LLM] [I] Total Memory: 179 GiB
[03/15/2025-08:09:16] [TRT-LLM] [I] Memory clock: 4000 MHz
[03/15/2025-08:09:16] [TRT-LLM] [I] Memory bus width: 7680
[03/15/2025-08:09:16] [TRT-LLM] [I] Memory bandwidth: 7680 GB/s
[03/15/2025-08:09:16] [TRT-LLM] [I] PCIe speed: 32000 Mbps
[03/15/2025-08:09:16] [TRT-LLM] [I] PCIe link width: 16
[03/15/2025-08:09:16] [TRT-LLM] [I] PCIe bandwidth: 64 GB/s
[03/15/2025-08:09:16] [TRT-LLM] [I] Set dtype to bfloat16.
[03/15/2025-08:09:16] [TRT-LLM] [I] Set paged_kv_cache to True.
[03/15/2025-08:09:16] [TRT-LLM] [W] Overriding paged_state to False
[03/15/2025-08:09:16] [TRT-LLM] [I] Set paged_state to False.
[03/15/2025-08:09:16] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 8192
[03/15/2025-08:09:16] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[03/15/2025-08:09:16] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[03/15/2025-08:09:22] [TRT] [I] [MemUsageChange] Init CUDA: CPU -12, GPU +0, now: CPU 3847, GPU 623 (MiB)
[03/15/2025-08:09:27] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +3117, GPU +490, now: CPU 7166, GPU 1113 (MiB)
[03/15/2025-08:09:27] [TRT-LLM] [I] Set nccl_plugin to None.
[03/15/2025-08:09:27] [TRT-LLM] [I] Total time of constructing network from module object 11.539793729782104 seconds
[03/15/2025-08:09:27] [TRT-LLM] [I] Total optimization profiles added: 1
[03/15/2025-08:09:27] [TRT-LLM] [I] Total time to initialize the weights in network Unnamed Network 0: 00:00:00
[03/15/2025-08:09:27] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[03/15/2025-08:09:27] [TRT] [W] Unused Input: position_ids
[03/15/2025-08:09:28] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[03/15/2025-08:09:28] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[03/15/2025-08:09:28] [TRT] [I] Compiler backend is used during engine build.
[03/15/2025-08:10:31] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[03/15/2025-08:10:31] [TRT] [I] Detected 17 inputs and 1 output network tensors.
[03/15/2025-08:10:32] [TRT] [I] Total Host Persistent Memory: 67520 bytes
[03/15/2025-08:10:32] [TRT] [I] Total Device Persistent Memory: 0 bytes
[03/15/2025-08:10:32] [TRT] [I] Max Scratch Memory: 754974720 bytes
[03/15/2025-08:10:32] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 178 steps to complete.
[03/15/2025-08:10:32] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 5.58666ms to assign 21 blocks to 178 nodes requiring 1056980992 bytes.
[03/15/2025-08:10:32] [TRT] [I] Total Activation Memory: 1056978944 bytes
[03/15/2025-08:10:32] [TRT] [I] Total Weights Memory: 9085432196 bytes
[03/15/2025-08:10:32] [TRT] [I] Compiler backend is used during engine execution.
[03/15/2025-08:10:32] [TRT] [I] Engine generation completed in 64.6883 seconds.
[03/15/2025-08:10:32] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 8665 MiB
[03/15/2025-08:10:35] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:07
[03/15/2025-08:10:35] [TRT] [I] Serialized 795 bytes of code generator cache.
[03/15/2025-08:10:35] [TRT] [I] Serialized 202507 bytes of compilation cache.
[03/15/2025-08:10:35] [TRT] [I] Serialized 8 timing cache entries
[03/15/2025-08:10:35] [TRT-LLM] [I] Timing cache serialized to model.cache
[03/15/2025-08:10:35] [TRT-LLM] [I] Build phase peak memory: 27214.23 MB, children: 5807.95 MB
[03/15/2025-08:10:35] [TRT-LLM] [I] Serializing engine to /code/engine/llama3-8b-fp8/rank0.engine...
[03/15/2025-08:10:38] [TRT-LLM] [I] Engine serialized. Total time: 00:00:03
[03/15/2025-08:10:38] [TRT-LLM] [I] Total time of building all engines: 00:01:22```
Seeing the same thing on the NVIDIA RTX PRO 6000
[07/30/2025-15:41:14] [TRT-LLM] [I] Set paged_kv_cache to True.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set paged_kv_cache to True.
[07/30/2025-15:41:14] [TRT-LLM] [W] Overriding paged_state to False
[WARNING | TRT-LLM ]: [TRT-LLM] [W] Overriding paged_state to False
[07/30/2025-15:41:14] [TRT-LLM] [I] Set paged_state to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set paged_state to False.
[07/30/2025-15:41:14] [TRT-LLM] [I] Set dtype to bfloat16.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set dtype to bfloat16.
[07/30/2025-15:41:14] [TRT-LLM] [I] Set paged_state to False.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set paged_state to False.
[07/30/2025-15:41:14] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[WARNING | TRT-LLM ]: [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[07/30/2025-15:41:19] [TRT] [I] [MemUsageChange] Init CUDA: CPU -2, GPU +0, now: CPU 8183, GPU 1407 (MiB)
[07/30/2025-15:41:21] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +711, GPU +484, now: CPU 8693, GPU 1891 (MiB)
[07/30/2025-15:41:21] [TRT-LLM] [I] Set nccl_plugin to None.
[INFO | TRT-LLM ]: [TRT-LLM] [I] Set nccl_plugin to None.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: FP8 FMHA cannot be enabled except on Ada or Hopper or Blackwell Arch. (/src/tensorrt_llm/cpp/tensorrt_llm/common/attentionOp.cpp:2109)
1 0x7f2430da5f48 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95
2 0x7f2430f358d4 tensorrt_llm::common::op::AttentionOp::initialize() + 4388
3 0x7f2356c0d510 tensorrt_llm::plugins::GPTAttentionPlugin* tensorrt_llm::plugins::GPTAttentionPluginCommon::cloneImpl<tensorrt_llm::plugins::GPTAttentionPlugin>() const + 1104
4 0x7f255ea3f620 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x17eb620) [0x7f255ea3f620]
5 0x7f255e964596 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1710596) [0x7f255e964596]
6 0x7f25861332e2 /usr/local/lib/python3.12/dist-packages/tensorrt/tensorrt.so(+0xf62e2) [0x7f25861332e2]
7 0x7f2586086ae2 /usr/local/lib/python3.12/dist-packages/tensorrt/tensorrt.so(+0x49ae2) [0x7f2586086ae2]
Edit: Using v0.20.0 works
Since the issue is verified fixed in v0.20.0, so I will close this one.