TensorRT-LLM [Bug] llama3.1-8b smoothquant error (use latest version: 5fa9436)

System Info

GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions: https://github.com/NVIDIA/TensorRT-LLM.git (5fa9436) (latest version) https://github.com/triton-inference-server/tensorrtllm_backend ( a6aa8eb)

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

step1: convert to smoothquant model python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir Meta-Llama-3.1-8B-Instruct-smqout --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 1

step2: convert to engine trtllm-build --checkpoint_dir Meta-Llama-3.1-8B-Instruct-smqout
--output_dir Meta-Llama-3.1-8B-Instruct-smqout-trtengine
--remove_input_padding enable
--context_fmha enable
--gemm_plugin float16
--paged_kv_cache enable
--max_num_tokens 65536
--max_batch_size 32
--max_input_len 32768
--gpt_attention_plugin float16

Expected behavior

Successfully converted engine

actual behavior

Error log: [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301 0.12.0.dev2024072301 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.49it/s] /usr/local/lib/python3.10/dist-packages/datasets/load.py:1491: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail You can avoid this message in future by passing the argument trust_remote_code=True. Passing trust_remote_code=Truewill be mandatory to load this dataset from the next major release ofdatasets. warnings.warn( Downloading builder script: 100%|██████████████████████████████████████████████████████████████████| 9.27k/9.27k [00:00<00:00, 44.9MB/s] Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████| 13.9k/13.9k [00:00<00:00, 71.3MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████| 159M/159M [00:01<00:00, 156MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████| 376M/376M [00:02<00:00, 163MB/s] Downloading data: 2.11MB [00:00, 121MB/s] Downloading data: 46.4MB [00:00, 127MB/s] Downloading data: 2.43MB [00:00, 125MB/s] Generating train split: 287113 examples [00:28, 10025.76 examples/s] Generating validation split: 13368 examples [00:01, 10251.71 examples/s] Generating test split: 11490 examples [00:01, 9442.46 examples/s] calibrating model: 0%| | 0/512 [00:00<?, ?it/s]We detected that you are passing past_key_valuesas a tuple and this is deprecated and will be removed in v4.43. Please use an appropriateCache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache) calibrating model: 100%|██████████████████████████████████████████████████████████████████████████████| 512/512 [00:26<00:00, 19.61it/s] Weights loaded. Total time: 00:00:04 Total time of converting checkpoints: 00:02:15 [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301 [07/25/2024-08:48:18] [TRT-LLM] [I] Set bert_attention_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gemm_plugin to float16. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set nccl_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set lookup_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set lora_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set moe_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set context_fmha to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set paged_kv_cache to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set remove_input_padding to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set reduce_fusion to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set enable_xqa to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set tokens_per_block to 64. [07/25/2024-08:48:18] [TRT-LLM] [I] Set use_paged_context_fmha to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set multiple_profiles to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set paged_state to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set streamingllm to False. [07/25/2024-08:48:18] [TRT-LLM] [W] max_seq_len is scaled to 1048576.0 by rotary scaling 8.0 [07/25/2024-08:48:18] [TRT-LLM] [I] max_seq_len is not specified, using value 1048576.0 [07/25/2024-08:48:18] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[07/25/2024-08:48:18] [TRT-LLM] [W] Specifying a max_num_tokens larger than 16384 is usually not recommended, we do not expect perf gain with that and too large max_num_tokens could possibly exceed the TensorRT tensor volume, causing runtime errors. Got max_num_tokens = 65536 [07/25/2024-08:48:18] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 535, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 371, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 338, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 307, in build_model model = model_cls.from_checkpoint(ckpt_dir, config=rank_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 428, in from_checkpoint model = cls(config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 364, in call obj.post_init() File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 380, in post_init quantize(self, self.config.quantization) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 276, in quantize model = smooth_quantize(model, quant_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 187, in smooth_quantize return smooth_quantize_plugin(model, quant_config.quant_mode) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 174, in smooth_quantize_plugin quant_layer = quant_cls(**init_params) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 1456, in init rotary_embedding_scaling["type"]) KeyError: 'type'`

additional notes

No

Jul 25 '24 11:07 fan-niu

@kaiyux Can you provide some help on this issue? Because need to use the smoothquant strategy on the A100 machine, thank you very much

Jul 29 '24 02:07 fan-niu

Thank you for the report. We can reproduce the issue and will fix it soon.

If you hope to fix it locally, you can refer self.rotary_embedding_scale_type setting of tensorrt_llm/layers/attention.py https://github.com/NVIDIA/TensorRT-LLM/blob/a681853d3803ee5893307e812530b5e7004bb6e1/tensorrt_llm/layers/attention.py#L402-L408 and fix it in tensorrt_llm/quantization/layers.py https://github.com/NVIDIA/TensorRT-LLM/blob/a681853d3803ee5893307e812530b5e7004bb6e1/tensorrt_llm/quantization/layers.py#L1454-L1459

Besides, there are two additional notes:

You use float16. As far as I know, most llama 3.1 models are trained by bfloat16. It has some accuracy risks to use float16 to run inference.
You don't setup the max_seq_len, which might lead to very long default max_seq_len and lead to OOM.

Aug 05 '24 07:08 byshiue

@byshiue Thank you very much, will verify this feature locally

Aug 05 '24 08:08 fan-niu

Did this fix work? Did not work for me.

Aug 06 '24 22:08 manu-web

Did this fix work? Did not work for me.

Already tested, it's good

Aug 07 '24 08:08 fan-niu

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Sep 07 '24 01:09 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

Sep 22 '24 02:09 github-actions[bot]