T5 model, large difference in results when `remove_input_padding` is enabled
System Info
Container: nvidia/cuda:12.4.1-devel-ubuntu22.04
GPU: L4
TensorRT-LLM version: 0.12.0.dev2024071600
Who can help?
No response
Information
- [X] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Follow the steps for installation
- Download weights of t5-small:
git clone https://huggingface.co/t5-small tmp/hf_models/t5-small - Convert weights from HF format to TRT-LLM format and build TRT engines with the following parameters:
export MODEL_NAME="t5-small"
export MODEL_TYPE="t5"
export INFERENCE_PRECISION="bfloat16"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1
export MAX_BEAM_WIDTH=1
export GPUS_PER_NODE=1
export PAGED_KV_CACHE="disable"
export MAX_BATCH_SIZE=32
export MAX_INPUT_LEN=512
export MAX_SEQ_LEN=201
export REMOVE_INPUT_PADDING="enable"
- Update input_text with the following content:
input_text = [
'Die Schale mit Murtlap-Essenz fiel zu Boden und zerbrach.',
'Harry wurde bewusst, dass er auf den Beinen war, obwohl er sich nicht erinnern konnte, aufgestanden zu sein.',
'Sie hatten das Ende der Reihe erreicht und traten wieder in trübes Kerzenlicht.',
'»Wir – das heißt deine Tante, Dudley und ich – wir gehen aus.«',
'Einen Moment lang geschah nichts, Filch starrte Cho zornfunkelnd an und Cho erwiderte seinen Blick nicht minder zornig, dann machte der Hausmeister kehrt und schlurfte zur Tür zurück.',
'Die Hand auf der Klinke, hielt er inne und blickte sich zu Harry um.',
'»Ja, er hat die Sache für mich rumgerissen«, bestätigte Harry.',
'Er wusste, es würde äußerst undankbar, wenn nicht gar kindisch klingen, wenn er sagte: »Hätte er doch nur mal mit mir gesprochen.',
'Oder mich wenigstens angesehen.«',
'Sie machte eine Miene, als hätte man sie gezwungen, Stinksaft zu schlucken, und ließ ihre Tasche wieder zuschnappen.',
'»Ich hab gesagt aufschlussreich, nicht gut«, sagte Hermine.', '»Sie hat vieles erklärt.«',
'»Gut!«, nuschelte Hagrid und stand auf, hielt sich mit der einen Hand seine blutende Nase zu und mit der anderen seine Armbrust umklammert.',
"»Also ... das war's dann ... ihr habt ihn kennen gelernt und – und er kennt euch jetzt, wenn ihr wieder kommt.", 'Jaah ... nun ...«',
'»Mit niemandem -«, sagte Harry und versuchte sich ihrem Griff zu entziehen.',
'»Von wem ist eigentlich der Brief?«, fragte Ron und nahm Harry das Blatt aus der Hand.',
'»Ich – ähm – hab gehört, sie geht mit jemand anderem«, sagte Hermine behutsam.',
'»Ja ... nun ...«, sagte Professor Umbridge, während Harry, Ron und Hermine sich so langsam wie möglich die Marmortreppe hochschleppten, »lassen Sie uns ins Lehrerzimmer gehen.',
'Ich möchte meinen, Sie könnten eine Tasse Tee vertragen nach Ihrer Reise.«', 'Okay, gehen wir.', 'Locomotor Koffer.«',
'»wie finden Sie als zeitweiliges Mitglied des Kollegiums - als neutrale Außenstehende, wie man vielleicht sagen könnte -, wie finden Sie Hogwarts?',
'»Sie haben es selbst gesehen, was soll das heißen?«, sagte Professor McGonagall und ihre dunklen Augenbrauen zogen sich zusammen.',
'Sie mussten im Laufschritt gehen, um mitzuhalten, während er über den Rasen marschierte und sich bei jedem zweiten Schritt umsah.',
'Als sie zu seiner Hütte gelangten, wandte sich Hermine wie selbstverständlich nach links zur Vordertür.',
'»Der Tornados-Hasser?«, sagte Cho ziemlich kühl.', '»Taugt er was?«',
'Als er sich Ron und Hermine gegenübersetzte, fiel sein Blick auf sein Spiegelbild im Fenster.',
'Er war sehr weiß und seine Narbe schien sich deutlicher abzuzeichnen als sonst.',
'Ron stand mit halb offenem Mund da, sichtlich bestürzt und vollkommen sprachlos, während Hermine den Tränen nahe schien.',
'Mit hämmerndem Herzen blickte er auf.'
]
- Run Python runtime
Expected behavior
HF outputs and TRT-LLM outputs should be roughly the same. As a side note, this is the case with FasterTransformer.
actual behavior
TRT-LLM outputs differ significantly from HF outputs. Generation quality is clearly worse: Repetitions, outputs with only special tokens etc.
additional notes
I just realized before submitting the issue that the TRT-LLM and HF results are much closer to each other when remove_input_padding is disabled. Therefore, changed the title accordingly. Additionally, similar to the previously created issues, I notice that the discrepancy between TRT-LLM and HF outputs increases as batch size is increased.
After the transformation of the T5 model is complete, with "remove input padding" set to false and a maximum batch size of 8, when the model inference is set to a batch size of 1, 4, or 8 in the float32 data type, there are slight differences in the results. In the float16 data type, the differences in results are even more pronounced. In theory, the results of inference with different batch sizes should be consistent. Is this a normal phenomenon?
I faced the same issue while running with TensorRT-LLM Triton Backend latest version, the output is not consistent and not the same as HF output.
When I run load test generate api, shuffling inputs causes the difference outputs. Sometime the output is entirely pad token string like <pad><pad><pad><pad><pad>...<pad> while output is normal with FasterTransformer.
@0xd8b @thanhlt998 Thanks for reporting this! It might be some bugs introduced in recent versions. We'll investigate this soon.
btw, if I remember correctly you have been actively using enc-dec TRT-LLM for a while, do you have some additonal info on whether this is a regression, like did everything work without any problem on your previous TRT-LLM version? If so, can you share which version was that?
@symphonylyh Yes! As I remember it worked normally at TRT-LLM version 0.11.0.dev2024061800.
@symphonylyh Do you find out the issue here?
@thanhlt998 we found (1) the issue only appears for batch size >= 18 (2) the issue is token-specific, i.e. only happens for specific tokens (3) it was due to some NaN elements after the 1st layer.
It's the progress so far -- we're still investigating the root cause. btw, we found 0604 version also had the same issue (Oh I noticed you have changed the version to 0618, we can try this as well)
@symphonylyh Let me know if there are something new insighted. I am very eager to apply TensorRT-LLM for enc-dec in production.
@thanhlt998 Just to share my own experience: When I use float32 and disable remove_input_padding, I see almost no discrepancy between HF outputs and TRT-LLM outputs with Python runtime (max_batch_size=32).
@ogaloglu I know, but I am eager to apply the inflight batching feature, not the dynamic batching feature. If use the above parameters, it is total like FasterTransformer with dynamic batching.
@thanhlt998 I see, I got your point! Just want to add that I can actually use bfloat16 in FasterTransformer without having this discrepancy in the outputs.
@symphonylyh Do you have any progress here?
@thanhlt998 @ogaloglu @0xd8b root caused and fixing internally now! It was due to the padding_offset setup in the gptAttentionCommon. Bascially we need a padding_offset for the encoder input lengths and a separate one for the decoder output lengths. Otherwise it will fail under certain edge cases when length(1st input seq in the batch) < batch size (also, under certain config: dtype != fp32 && remove_padding mode)...
Expect a fix in next week's main branch update!
@symphonylyh Expect your fix next week!
@symphonylyh when will the fixes be landed on main branch this week?
@thanhlt998 @ogaloglu @0xd8b updated on main branch now! closing the issue. Please verify
@symphonylyh I am trying your fixes with tensorrtllm triton backend at the latest version on main branch. But I can not launch the engines with triton backend after serializing encoder and decoder engines. I suspect the cause of triton 24.07 but when I build the docker with triton 24.05 (launch model normally before your fixes), it still raises the same error:
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I0821 15:04:54.552435 2226 pb_stub.cc:2121] Non-graceful termination detected.
I0821 15:04:54.552446 2227 pb_stub.cc:2121] Non-graceful termination detected.
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 2420ae08cfa8 exited on signal 8 (Floating point exception).
So I cannot verify your fixes now :(( The same problem is reported at here.
@thanhlt998 you can always follow enc_dec/README.md and use the exampes/run.py or examples/enc_dec/run.py to verify by changing the input text.
The triton backend error you saw should be either a version of setup issue from your end, because we have tested from e2e and didn't see such error. Maybe you can verify with non-triton first?
@symphonylyh Yes, I ran tensorrt engines with c++ backend with examples/enc_dec/run.py in the built docker container successfully. I can always follow the enc_dec/README.md. But I cannot launch the engines with triton backend.
@thanhlt998 I see. It looks like a general triton issue not limited to enc-dec. We need to wait for the backend people to investigate and fix then
@symphonylyh thanks a lot; I can confirm that, for this input_text, there is no difference in the results depending on the remove_input_padding parameter (bfloat16 and Python runtime are used)! I can also confirm that the difference between HF and TRT outputs is much smaller now. Yet I would like to share one observation in case it can be helpful to identify an unexpected behavior.
So, the following experiments are still based on the input_text list that I shared above. And I am using the same parameters:
export MODEL_NAME="t5-small"
export MODEL_TYPE="t5"
export INFERENCE_PRECISION="bfloat16"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1
export MAX_BEAM_WIDTH=1
export GPUS_PER_NODE=1
export PAGED_KV_CACHE="disable"
export MAX_BATCH_SIZE=32
export MAX_INPUT_LEN=512
export MAX_SEQ_LEN=201
export REMOVE_INPUT_PADDING="enable"
When I pass a batch with the first two sentences of input_text, HF and TRT results are completely the same. But when I pass a batch with more than 4 sentences from the input_text, the result for the second sentences is very different.
This is the respective sentence: 'Harry wurde bewusst, dass er auf den Beinen war, obwohl er sich nicht erinnern konnte, aufgestanden zu sein.'
HF output (basically the same as the input):
'Harry wurde bewusst, dass er auf den Beinen war, dass er sich nicht erinnern konnte, aufgestanden zu sein.’
TRT output:
'was a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man, a man’
Can there be still some token-specific problems? Thanks a lot in advance! Please let me know if you think I should rather create a new issue for that!
@ogaloglu thanks for the info. We can continue using this issue to investigate. Reopened now
@symphonylyh thank you! Looking forward to your updates!
@symphonylyh I've tested your fixes with triton backend, it still have the same issues before but with lower frequency. Looking forward to your insights of these issues!
@ogaloglu the "a man, a man" issue appears to be only with bfloat16. When you switch to float16, it's seems fine. This is not a fix yet, but just to share findings
@symphonylyh In my experiments, I only used RTX2080Ti (support float16 only) so I think float16/bfloat16 is not the cause of this issue.
@symphonylyh thank you for the insight! Then, I will run some experiments early next week and share the outcomes.
@0xd8b @ogaloglu @thanhlt998 good news!
The main issue regarding remove padding has been fixed and will release next week. Explanation: when handling remove padding mode and split from a [num_tokens, hidden_size * 3] QKV into [BS, seq_len, hidden_size] Q/K/V, the Q/K/V buffers are uninitialized. so those halo elements will have random value. The fix is to memset at the beginning, just like the GPTAttention plugin here.
If you want to verify locally first, you can search for invokeAddFusedQKVBiasTranspose in bertAttentionPlugin.cpp, and copy the cudaMemset call (link above) and put it before invokeAddFusedQKVBiasTranspose.
This should solve a failure when doing IFB + Remove padding + when BS is large
The "a man, a man" issue is still somehow vague, but I will suggest @ogaloglu to switch to float16 because I saw the "a man" in bfloat16 but switching to float16 gives 100% match with HF. This cause must be something else and I would suggest filing another bug.
The more fundamental issue should be fixed by the solution in my comment here, and will be released next week
@symphonylyh It is great to hear your solution to fix this issue. I am looking forward to your update next week!
@symphonylyh I have tested the fixes following your instruction. After the load test processing with 999 samples (random shuffling):
- TRT-LLM vs FasterTransformer: EM = 938/999 samples, avg_fuzzy_score = 98.65
- TRT-LLM (1st time) vs TRT-LLM (2nd time): EM = 985/999 samples, avg_fuzzy_score = 99.76
The results are much better than before the fixes. But it still has slight difference between TRT-LLM and FasterTransformer or two diffence run times.
@thanhlt998 do you have a concreate input text to reproduce the difference? So far we think the TRT-LLM results are correct and good match with HF. While an exact 100% match cannot be guaranteed among TRT-LLM, FT, HF because their kernel selected is indeed different
But if you found weird output, please post here with a reproducer and we'll investigate