TensorRT-LLM V100 GPU : Facing issue with model conversion from nemo format to trt format

System Info

I am using nemo-inferencing container to convert nemo llama-13b model to trt model using following command.

python /opt/NeMo/scripts/export/export_to_trt.py --nemo_checkpoint /opt/checkpoints/results/Llama-2-13B-fp16.nemo --model_type="llama" --model_repository /opt/checkpoints/results/tmp_model_repository/

I am executing this on my V100 GPU machine and i get following issue.

[02/06/2024-07:57:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 13227, GPU 1012 (MiB) [02/06/2024-07:57:29] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +482, GPU +78, now: CPU 13844, GPU 1090 (MiB) [27] Feb 06 07:57:29 [WARNING] - TRT-LLM - [TRT-LLM] [W] Invalid timing cache, using freshly created one [27] Feb 06 07:57:29 [ INFO] - TRT-LLM - [TRT-LLM] [I] Context FMHA Enabled [27] Feb 06 07:57:29 [WARNING] - TRT-LLM - [TRT-LLM] [W] RMSnorm plugin is going to be deprecated, disable it for better performance. [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: Unsupported data type, pre SM 80 GPUs do not support bfloat16 (/home/jenkins/agent/workspace/LLM/release-_

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Spin up nemo-inferencing container
Execute following from within the container

python /opt/NeMo/scripts/export/export_to_trt.py --nemo_checkpoint /opt/checkpoints/results/Llama-2-13B-fp16.nemo --model_type="llama" --model_repository /opt/checkpoints/results/tmp_model_repository/

Expected behavior

Model should be converted in expected format

actual behavior

Conversion Error

additional notes

None

Feb 08 '24 10:02 ashish-kumar-hpe

As the error msg said: Assertion failed: Unsupported data type, pre SM 80 GPUs do not support bfloat16 So it's the expected behavior rather than bug.

Feb 08 '24 14:02 nv-guomingz

Is there a way I could get the tensorrt_llm to work on my V100 GPU with llama model ?

Feb 08 '24 14:02 ashish-kumar-hpe

You can set the dtype to fp16

Feb 08 '24 14:02 nv-guomingz

export_to_trt.py script that is there in the nemo-inference container from ngc has a check that dtype anything other than bfloat16 is not supported by tensorrt-llm and as you said bfloat16 is not supported on V100 GPUs.

Feb 08 '24 14:02 ashish-kumar-hpe

Another war is disable gpt attention plugin via --gpt_attention_plugin=disable

Feb 08 '24 15:02 nv-guomingz

nemo export_to_trt does not support this argument i.e. "--gpt_attention_plugin=disable". Any recommendation on how can i set the parameter "gpt_attention_plugin" nemo-inference?

Feb 09 '24 06:02 ashish-kumar-hpe

this opt is applied for building engine via trtllm-build interface, if this opt was not supported by nemo script and we don't have any recommendation suggestion for enabling model conversion on v100, especically, the nemo official doc said that the script only supports A100 and H100 platform for exporting nemo models to trtllm

Feb 09 '24 08:02 nv-guomingz

TensorRT-LLM TensorRT-LLM copied to clipboard

V100 GPU : Facing issue with model conversion from nemo format to trt format

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard