"High GPU usage leads to NaN values in the encoder output of the T5 model (float16).
System Info
GPU: NVIDIA A100
TensorRT-LLM version 0.9.0.dev2024031900
Who can help?
@symphonylyh @byshiue
Information
- [X] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
We use files from the example folder to transform the T5 model.
- Using float32 type:
- GPU initial usage is 0%, model inference works correctly, and the inference results are accurate.
- GPU initial usage is 100%, model inference works normally.
- Using float16 type:
- GPU initial usage is 0%, model inference works correctly, and the inference results are accurate.
- GPU initial usage is 100%, model inference results in abnormalities (encoder outputs NaN values).
We truncated the model output. Why does different GPU usage rates lead to overflow in the model output? This is an interesting question.
Expected behavior
Using float16 type, the initial utilization of the GPU should not result in overflow of the model's output.
actual behavior
However, in practice, there is a discrepancy between the experimental results and expectations.
additional notes
We believe this might be a bug.
Could you share the end to end reproduced steps?
Also, how to make GPU initial usage is 100%? Running another program at the same time?
Hi,
We had kind of similar issue, for looping and sending many times the same input data to stress test the GPU.
It got us some NaN after hundred of thousands of inference (again same data all the time, and synchronizing everywhere it made sense)! Model follows Roberta architecture.
We noticed that another Roberta based architecture, compiled the same way had not such issue, so we thought it s weight related. We switched to BF16, keeping the fp16 optimized kernel for flash attention, during build trt complains a lot that some nodes gets bf16 and fp16 inputs (I guess to let us know it adds some casting to many places), still e2e perf are almost the same, output precision is slightly lower (compared to fp32 AMP reference model), but no more issue, and we stress tested it quite a lot since then.
May be something to test @0xd8b ?
We converted the T5 model using the files in example/enc_dec/. The data type used for conversion is float16 (batch_size=1, strongly_typed=True, use_bert_plugin=True). Additionally, we truncated the output of hidden_states. However, during high GPU utilization, the encoder output becomes NaN.
Yes, we concurrently ran another model inference program to achieve 100% GPU utilization, yet only 1/10 of the total memory was utilized.
An intriguing observation is that NaN occurrences do not happen during TRT conversion with float32 data types. Additionally, in float16 types, if the GPU's initial utilization is 0%, inference proceeds normally. From this observation, it seems that the issue is not solely related to data overflow.
@pommedeterresautee thans for your reply! Are you referring to the conversion of the model using the bfloat16 data type
Yes, The conversion to bf16 is done during conversion step. We had to modify the part where it binds weights to trt engine. Have a check to _utils of trt lib, there are some useful stuff for bf16.
Fwiw back in the time I wrote a bunch of custom triton (language not server) kernels and T5 weights were super hard to manage in fp16, largest flavors NaN from time to time depending of the input. It's a long time i didn't touch it, but from what I barely remember Google trained it in bf16. I know at first view it's not related to gpu occupation but may be something to keep in mind.
@pommedeterresautee Ok, thanks for the suggestion, I will give it a try. However, I'm still curious as to why the model (float16 type) works fine at low gpu usage.
We attempted to convert the model to bfloat16 and conduct inference, yet the issue persists even under high GPU utilization. It seems that there's a problem occurring in the computation of the RMSnorm layer."
The issue is also caused by the encoder_input_length problem described in https://github.com/NVIDIA/TensorRT-LLM/issues/1847. This issue can be closed.