TensorRT-LLM
TensorRT-LLM copied to clipboard
AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape
System Info
colab T4
Who can help?
@
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/README.md#distil-whisper
command:
!trtllm-build --checkpoint_dir distil_whisper_medium_en_weights_int8/encoder
--output_dir distil_whisper_medium_en_int8/encoder
--paged_kv_cache disable
--moe_plugin disable
--enable_xqa disable
--max_batch_size 8
--gemm_plugin disable
--bert_attention_plugin float16
--remove_input_padding disable
--max_input_len 1500
Expected behavior
Successful compilation of models
actual behavior
ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.)
additional notes
I got an error when I tried: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/README.md#distil-whisper My command was:
!trtllm-build --checkpoint_dir distil_whisper_medium_en_weights_int8/encoder
--output_dir distil_whisper_medium_en_int8/encoder
--paged_kv_cache disable
--moe_plugin disable
--enable_xqa disable
---max_batch_size 8
--gemm_plugin disable
--bert_attention_plugin float16
--remove_input_padding disable
---max_input_len 1500
The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(**inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape
For distill-whisper, would you mind adding model=model.half() here https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py#L60 for now?
The code fix will be synced to github later. Thanks.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."