[Bug] llama3.1-8b smoothquant error (use latest version: 5fa9436)
System Info
System Info
GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions: https://github.com/NVIDIA/TensorRT-LLM.git (5fa9436) (latest version) https://github.com/triton-inference-server/tensorrtllm_backend ( a6aa8eb)
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
step1: convert to smoothquant model python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir Meta-Llama-3.1-8B-Instruct-smqout --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 1
step2: convert to engine
trtllm-build --checkpoint_dir Meta-Llama-3.1-8B-Instruct-smqout
--output_dir Meta-Llama-3.1-8B-Instruct-smqout-trtengine
--remove_input_padding enable
--context_fmha enable
--gemm_plugin float16
--paged_kv_cache enable
--max_num_tokens 65536
--max_batch_size 32
--max_input_len 32768
--gpt_attention_plugin float16
Expected behavior
Successfully converted engine
actual behavior
Error log:
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301 0.12.0.dev2024072301 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.49it/s] /usr/local/lib/python3.10/dist-packages/datasets/load.py:1491: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail You can avoid this message in future by passing the argument trust_remote_code=True. Passing trust_remote_code=Truewill be mandatory to load this dataset from the next major release ofdatasets. warnings.warn( Downloading builder script: 100%|██████████████████████████████████████████████████████████████████| 9.27k/9.27k [00:00<00:00, 44.9MB/s] Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████| 13.9k/13.9k [00:00<00:00, 71.3MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████| 159M/159M [00:01<00:00, 156MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████| 376M/376M [00:02<00:00, 163MB/s] Downloading data: 2.11MB [00:00, 121MB/s] Downloading data: 46.4MB [00:00, 127MB/s] Downloading data: 2.43MB [00:00, 125MB/s] Generating train split: 287113 examples [00:28, 10025.76 examples/s] Generating validation split: 13368 examples [00:01, 10251.71 examples/s] Generating test split: 11490 examples [00:01, 9442.46 examples/s] calibrating model: 0%| | 0/512 [00:00<?, ?it/s]We detected that you are passing past_key_valuesas a tuple and this is deprecated and will be removed in v4.43. Please use an appropriateCache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
calibrating model: 100%|██████████████████████████████████████████████████████████████████████████████| 512/512 [00:26<00:00, 19.61it/s]
Weights loaded. Total time: 00:00:04
Total time of converting checkpoints: 00:02:15
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301
[07/25/2024-08:48:18] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set gemm_plugin to float16.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set lookup_plugin to None.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set lora_plugin to None.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set moe_plugin to auto.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set context_fmha to True.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set remove_input_padding to True.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set reduce_fusion to False.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set enable_xqa to True.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set tokens_per_block to 64.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set multiple_profiles to False.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set paged_state to True.
[07/25/2024-08:48:18] [TRT-LLM] [I] Set streamingllm to False.
[07/25/2024-08:48:18] [TRT-LLM] [W] max_seq_len is scaled to 1048576.0 by rotary scaling 8.0
[07/25/2024-08:48:18] [TRT-LLM] [I] max_seq_len is not specified, using value 1048576.0
[07/25/2024-08:48:18] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[07/25/2024-08:48:18] [TRT-LLM] [W] Specifying a max_num_tokens larger than 16384 is usually not recommended, we do not expect perf gain with that and too large max_num_tokens could possibly exceed the TensorRT tensor volume, causing runtime errors. Got max_num_tokens = 65536
[07/25/2024-08:48:18] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
additional notes
No
@kaiyux Can you provide some help on this issue? Because need to use the smoothquant strategy on the A100 machine, thank you very much
Thank you for the report. We can reproduce the issue and will fix it soon.
If you hope to fix it locally, you can refer self.rotary_embedding_scale_type setting of tensorrt_llm/layers/attention.py https://github.com/NVIDIA/TensorRT-LLM/blob/a681853d3803ee5893307e812530b5e7004bb6e1/tensorrt_llm/layers/attention.py#L402-L408 and fix it in tensorrt_llm/quantization/layers.py https://github.com/NVIDIA/TensorRT-LLM/blob/a681853d3803ee5893307e812530b5e7004bb6e1/tensorrt_llm/quantization/layers.py#L1454-L1459
Besides, there are two additional notes:
- You use float16. As far as I know, most llama 3.1 models are trained by bfloat16. It has some accuracy risks to use float16 to run inference.
- You don't setup the
max_seq_len, which might lead to very long defaultmax_seq_lenand lead to OOM.
@byshiue Thank you very much, will verify this feature locally
Did this fix work? Did not work for me.
Did this fix work? Did not work for me.
Already tested, it's good
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.