TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Bug] llama3.1-8b smoothquant error (use latest version: 5fa9436)

Open fan-niu opened this issue 1 year ago • 5 comments

System Info

System Info

GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions: https://github.com/NVIDIA/TensorRT-LLM.git (5fa9436) (latest version) https://github.com/triton-inference-server/tensorrtllm_backend ( a6aa8eb)

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

step1: convert to smoothquant model python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct --output_dir Meta-Llama-3.1-8B-Instruct-smqout --dtype float16 --smoothquant 0.5 --per_token --per_channel --tp_size 1

step2: convert to engine trtllm-build --checkpoint_dir Meta-Llama-3.1-8B-Instruct-smqout
--output_dir Meta-Llama-3.1-8B-Instruct-smqout-trtengine
--remove_input_padding enable
--context_fmha enable
--gemm_plugin float16
--paged_kv_cache enable
--max_num_tokens 65536
--max_batch_size 32
--max_input_len 32768
--gpt_attention_plugin float16

Expected behavior

Successfully converted engine

actual behavior

Error log: [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301 0.12.0.dev2024072301 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.49it/s] /usr/local/lib/python3.10/dist-packages/datasets/load.py:1491: FutureWarning: The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail You can avoid this message in future by passing the argument trust_remote_code=True. Passing trust_remote_code=Truewill be mandatory to load this dataset from the next major release ofdatasets. warnings.warn( Downloading builder script: 100%|██████████████████████████████████████████████████████████████████| 9.27k/9.27k [00:00<00:00, 44.9MB/s] Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████| 13.9k/13.9k [00:00<00:00, 71.3MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████| 159M/159M [00:01<00:00, 156MB/s] Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████| 376M/376M [00:02<00:00, 163MB/s] Downloading data: 2.11MB [00:00, 121MB/s] Downloading data: 46.4MB [00:00, 127MB/s] Downloading data: 2.43MB [00:00, 125MB/s] Generating train split: 287113 examples [00:28, 10025.76 examples/s] Generating validation split: 13368 examples [00:01, 10251.71 examples/s] Generating test split: 11490 examples [00:01, 9442.46 examples/s] calibrating model: 0%| | 0/512 [00:00<?, ?it/s]We detected that you are passing past_key_valuesas a tuple and this is deprecated and will be removed in v4.43. Please use an appropriateCache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache) calibrating model: 100%|██████████████████████████████████████████████████████████████████████████████| 512/512 [00:26<00:00, 19.61it/s] Weights loaded. Total time: 00:00:04 Total time of converting checkpoints: 00:02:15 [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301 [07/25/2024-08:48:18] [TRT-LLM] [I] Set bert_attention_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gemm_plugin to float16. [07/25/2024-08:48:18] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set nccl_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set lookup_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set lora_plugin to None. [07/25/2024-08:48:18] [TRT-LLM] [I] Set moe_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [07/25/2024-08:48:18] [TRT-LLM] [I] Set context_fmha to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set paged_kv_cache to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set remove_input_padding to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set reduce_fusion to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set enable_xqa to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set tokens_per_block to 64. [07/25/2024-08:48:18] [TRT-LLM] [I] Set use_paged_context_fmha to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set multiple_profiles to False. [07/25/2024-08:48:18] [TRT-LLM] [I] Set paged_state to True. [07/25/2024-08:48:18] [TRT-LLM] [I] Set streamingllm to False. [07/25/2024-08:48:18] [TRT-LLM] [W] max_seq_len is scaled to 1048576.0 by rotary scaling 8.0 [07/25/2024-08:48:18] [TRT-LLM] [I] max_seq_len is not specified, using value 1048576.0 [07/25/2024-08:48:18] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[07/25/2024-08:48:18] [TRT-LLM] [W] Specifying a max_num_tokens larger than 16384 is usually not recommended, we do not expect perf gain with that and too large max_num_tokens could possibly exceed the TensorRT tensor volume, causing runtime errors. Got max_num_tokens = 65536 [07/25/2024-08:48:18] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 535, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 371, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 338, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 307, in build_model model = model_cls.from_checkpoint(ckpt_dir, config=rank_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 428, in from_checkpoint model = cls(config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 364, in call obj.post_init() File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 380, in post_init quantize(self, self.config.quantization) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 276, in quantize model = smooth_quantize(model, quant_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 187, in smooth_quantize return smooth_quantize_plugin(model, quant_config.quant_mode) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py", line 174, in smooth_quantize_plugin quant_layer = quant_cls(**init_params) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 1456, in init rotary_embedding_scaling["type"]) KeyError: 'type'`

additional notes

No

fan-niu avatar Jul 25 '24 11:07 fan-niu

@kaiyux Can you provide some help on this issue? Because need to use the smoothquant strategy on the A100 machine, thank you very much

fan-niu avatar Jul 29 '24 02:07 fan-niu

Thank you for the report. We can reproduce the issue and will fix it soon.

If you hope to fix it locally, you can refer self.rotary_embedding_scale_type setting of tensorrt_llm/layers/attention.py https://github.com/NVIDIA/TensorRT-LLM/blob/a681853d3803ee5893307e812530b5e7004bb6e1/tensorrt_llm/layers/attention.py#L402-L408 and fix it in tensorrt_llm/quantization/layers.py https://github.com/NVIDIA/TensorRT-LLM/blob/a681853d3803ee5893307e812530b5e7004bb6e1/tensorrt_llm/quantization/layers.py#L1454-L1459

Besides, there are two additional notes:

  1. You use float16. As far as I know, most llama 3.1 models are trained by bfloat16. It has some accuracy risks to use float16 to run inference.
  2. You don't setup the max_seq_len, which might lead to very long default max_seq_len and lead to OOM.

byshiue avatar Aug 05 '24 07:08 byshiue

@byshiue Thank you very much, will verify this feature locally

fan-niu avatar Aug 05 '24 08:08 fan-niu

Did this fix work? Did not work for me.

manu-web avatar Aug 06 '24 22:08 manu-web

Did this fix work? Did not work for me.

Already tested, it's good

fan-niu avatar Aug 07 '24 08:08 fan-niu

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Sep 07 '24 01:09 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

github-actions[bot] avatar Sep 22 '24 02:09 github-actions[bot]