TensorRT-LLM
TensorRT-LLM copied to clipboard
V100 GPU : Facing issue with model conversion from nemo format to trt format
System Info
I am using nemo-inferencing container to convert nemo llama-13b model to trt model using following command.
python /opt/NeMo/scripts/export/export_to_trt.py --nemo_checkpoint /opt/checkpoints/results/Llama-2-13B-fp16.nemo --model_type="llama" --model_repository /opt/checkpoints/results/tmp_model_repository/
I am executing this on my V100 GPU machine and i get following issue.
[02/06/2024-07:57:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 13227, GPU 1012 (MiB) [02/06/2024-07:57:29] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +482, GPU +78, now: CPU 13844, GPU 1090 (MiB) [27] Feb 06 07:57:29 [WARNING] - TRT-LLM - [TRT-LLM] [W] Invalid timing cache, using freshly created one [27] Feb 06 07:57:29 [ INFO] - TRT-LLM - [TRT-LLM] [I] Context FMHA Enabled [27] Feb 06 07:57:29 [WARNING] - TRT-LLM - [TRT-LLM] [W] RMSnorm plugin is going to be deprecated, disable it for better performance. [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: Unsupported data type, pre SM 80 GPUs do not support bfloat16 (/home/jenkins/agent/workspace/LLM/release-_
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Spin up nemo-inferencing container
- Execute following from within the container
python /opt/NeMo/scripts/export/export_to_trt.py --nemo_checkpoint /opt/checkpoints/results/Llama-2-13B-fp16.nemo --model_type="llama" --model_repository /opt/checkpoints/results/tmp_model_repository/
Expected behavior
Model should be converted in expected format
actual behavior
Conversion Error
additional notes
None
As the error msg said: Assertion failed: Unsupported data type, pre SM 80 GPUs do not support bfloat16 So it's the expected behavior rather than bug.
Is there a way I could get the tensorrt_llm to work on my V100 GPU with llama model ?
You can set the dtype to fp16
export_to_trt.py script that is there in the nemo-inference container from ngc has a check that dtype anything other than bfloat16 is not supported by tensorrt-llm and as you said bfloat16 is not supported on V100 GPUs.
Another war is disable gpt attention plugin via --gpt_attention_plugin=disable
nemo export_to_trt does not support this argument i.e. "--gpt_attention_plugin=disable". Any recommendation on how can i set the parameter "gpt_attention_plugin" nemo-inference?
this opt is applied for building engine via trtllm-build interface, if this opt was not supported by nemo script and we don't have any recommendation suggestion for enabling model conversion on v100, especically, the nemo official doc said that the script only supports A100 and H100 platform for exporting nemo models to trtllm