TensorRT
TensorRT copied to clipboard
BART Error: 'BARTTRTDecoder' object has no attribute 'trt_context_non_kv'
Description
Follow https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb to build the tensorRT example for Bart model.
Get warning in bart_trt_decoder = BARTTRTDecoder(bart_trt_decoder_engine, metadata, tfm_config)
Cannot find binding of given name: past_key_values.0.decoder.key
and error in outputs = bart_trt_decoder(input_ids, encoder_last_hidden_state)
'BARTTRTDecoder' object has no attribute 'trt_context_non_kv'
Environment
TensorRT Version: '8.4.1.5' NVIDIA GPU: A100-SXM4-40GB NVIDIA Driver Version: 460.73.01 CUDA Version: 11.2 CUDNN Version: 8.0.5 Operating System: Debian GNU/Linux 10 (buster) Python Version (if applicable): 3.7.12 Tensorflow Version (if applicable): PyTorch Version (if applicable): '1.11.0' Baremetal or Container (if so, version):
Relevant Files
Steps To Reproduce
6 outputs = bart_trt_decoder(input_ids, encoder_last_hidden_state)~/projects/bart/code/TensorRT/demo/HuggingFace/NNDF/tensorrt_utils.py in call(self, *args, **kwargs) 166 def call(self, *args, **kwargs): 167 self.trt_context.active_optimization_profile = self.profile_idx --> 168 return self.forward(*args, **kwargs) 169 170 class PolygraphyOnnxRunner:
~/qinqing/projects/bart/code/TensorRT/demo/HuggingFace/BART/trt.py in forward(self, input_ids, encoder_hidden_states, *args, **kwargs) 401 402 # denote as variable to allow switch between non-kv and kv engines in kv cache mode --> 403 trt_context = self.trt_context_non_kv if non_kv_flag else self.trt_context 404 bindings = self.bindings_non_kv if non_kv_flag else self.bindings 405 inputs = self.inputs_non_kv if non_kv_flag else self.inputs
AttributeError: 'BARTTRTDecoder' object has no attribute 'trt_context_non_kv' -->
Tried python3 run.py compare BART --variant facebook/bart-base --working-dir temp
also get error:
Collecting Data for onnxrt
Traceback (most recent call last):
File "run.py", line 297, in
@kevinch-nv ^ ^
It works with the use_cache=False. Is there any effects without cache?
You can set use_cache=False
for now. The kv cache feature is not fully supported in notebooks yet. We'll add updated notebooks supporting this feature in one of our next releases.
Thank you for the info! It seems a large batch size is also not supported yet. Could you please confirm?
Tried
python3 run.py compare BART --variant facebook/bart-base --working-dir temp
also get error: Collecting Data for onnxrt Traceback (most recent call last): File "run.py", line 297, in main() File "run.py", line 293, in main return action.execute(known_args) File "run.py", line 190, in execute results.append(module.RUN_CMD()) File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/NNDF/interface.py", line 406, in call super().call() File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/NNDF/interface.py", line 99, in call self.metadata = self.args_to_network_metadata(self._args) File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/BART/onnxrt.py", line 314, in args_to_network_metadata precision=Precision(fp16=args.fp16, tf32=args.tf32), AttributeError: 'Namespace' object has no attribute 'tf32'
@Luckick for this issue, good observation. This was from a version mismatch during this demo development and will be fixed in next update. Meanwhile, you can do: change to precision=Precision(fp16=args.fp16)
, by removing the tf32 field for line https://github.com/NVIDIA/TensorRT/blob/d90e0d1df80d7d50bd7603fa1dc30773046d36ae/demo/HuggingFace/BART/onnxrt.py#L314. By default it will run FP32/TF32
Thank you for the info! It seems a large batch size is also not supported yet. Could you please confirm?
Can you add more information for this? Running with batch should work because the TRT engines built all support batching. Although the example Python commands will run inputs in checkpoint.toml
where there is only one put there as an example. This is true for both T5 and BART demos. For notebooks the inputs
can be modified to be batched sequences
I create 32 duplicates for input and specified the batch size, and it should be passed through the profile. However I get the error. `inputs = tokenizer(["translate English to German: That is good."] * 32, return_tensors="pt") batch_size = 32 max_sequence_length = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[model_name]
decoder_profile = Profile() decoder_profile.add( "input_ids", min=(batch_size, 1), opt=(batch_size, max_sequence_length // 2), max=(batch_size, max_sequence_length), ) decoder_profile.add( "encoder_hidden_states", min=(batch_size, 1, max_sequence_length), opt=(batch_size, max_sequence_length // 2, max_sequence_length), max=(batch_size, max_sequence_length, max_sequence_length), )
encoder_profile = Profile() encoder_profile.add( "input_ids", min=(batch_size, 1), opt=(batch_size, max_sequence_length // 2), max=(batch_size, max_sequence_length), )`
[08/01/2022-18:25:22] [TRT] [E] 3: [executionContext.cpp::setBindingDimensions::965] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::965, condition: profileMaxDims.d[i] >= dimensions.d[i]. Supplied binding dimension [32,768] for bindings[0] exceed min ~ max range at index 0, maximum dimension in profile is 1, minimum dimension in profile is 1, but supplied dimension is 32.
)
Traceback (most recent call last):
File "bart.py", line 184, in
@Luckick I see. TensorRT inference includes two phases: (1) engine building (2) execution. The above error shows it's at step 2, and it's because your built engines still have fixed batch_size = 1 as valid input dimensions. Note that the Profile only affects step 1, so if you have built the engine with bs=1 in step 1, and only change the Profile parameters for step 2, that wouldn't work.
Therefore, what you should do is to update the profile based on your need, and must build new engines using that profile (you may need to delete the old engine, becasue by default engine building will be skipped if files already exist).
More you can do suppose your use case has dynamic batch size: you can specify in the min
, opt
, max
above to take, e.g., many available dimensions from bs=1 to bs=xxx. --> this is a TensorRT feature called Dynamic Shape
The batch size assignments are before the tenorRT engine. I am following https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb, and the batch size was assigned at the very beginning of tensorRT section.
The code below are AFTER the batch assignment: `bart_trt_encoder_engine = BARTEncoderONNXFile( os.path.join(onnx_model_path, encoder_onnx_model_fpath), metadata ).as_trt_engine(os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + ".engine", profiles=[encoder_profile])
bart_trt_decoder_engine = BARTDecoderONNXFile( os.path.join(onnx_model_path, decoder_onnx_model_fpath), metadata ).as_trt_engine(os.path.join(tensorrt_model_path, decoder_onnx_model_fpath) + ".engine", profiles=[decoder_profile])
from BART.trt import BARTTRTEncoder, BARTTRTDecoder
tfm_config = BartConfig( use_cache=False, num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[model_name], )
bart_trt_encoder = BARTTRTEncoder( bart_trt_encoder_engine, metadata, tfm_config, batch_size = batch_size ) bart_trt_decoder = BARTTRTDecoder( bart_trt_decoder_engine, metadata, tfm_config, batch_size = batch_size )`
Yes, the as_trt_engine
lines are where the engines got really built. Did you see some log in the notebook like TRT is building the engine (and usually this engine building takes a while), or it just used an existing *.engine
file and quickly went through the building part? For most clean check, you can check you saving fpath and delete those *.engine files and re-run the notebook steps again.
I see. I was using the old engine built with batch_size=1.
Another modification: for decode profile, we should have hidden_dim = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[model_name]
decoder_profile.add( "encoder_hidden_states", min=(batch_size, 1, hidden_dim), opt=(batch_size, max_sequence_length // 2, hidden_dim), max=(batch_size, max_sequence_length, hidden_dim), )
I see. I was using the old engine built with batch_size=1.
Another modification: for decode profile, we should have hidden_dim = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[model_name]
decoder_profile.add( "encoder_hidden_states", min=(batch_size, 1, hidden_dim), opt=(batch_size, max_sequence_length // 2, hidden_dim), max=(batch_size, max_sequence_length, hidden_dim), )
It's good to hear the issue is solved by cleaning engine cache.
For the modification you mentioned, yes if you're starting based on T5 it's recommended to do such changes. Actually, if you run Python commands, this is already fixed here. The story is, T5 has certain legacies like mixing MAX_ENCODER_LENGTH & ENCODER_HIDDEN_SIZE, and we fixed that in BART demo and plan to back port it to T5 demo later too.
Meanwhile, for any other users who encounter the similar issue @Luckick has, you're advised to follow the discussion on this page to make a BART notebook working by modifying from T5's notebook, before we officially release the BART notebook.
Is it a command or convenient way to set up the engine for a local checkpoint of fine-tuned bart model, or a customized bart model?
Is it a command or convenient way to set up the engine for a local checkpoint of fine-tuned bart model, or a customized bart model?
The easiest way I can think of without making structural changes is to go into frameworks.py
: generate_and_download_framework()
, and simply replace .from_pretrained(metadata.variant)
with your local checkpoint .from_pretrained(checkpoint_file)
, suppose you fine-tuned on one of the bart-base, bart-large, bart-large-cnn models. If not, you may modify BARTModelConfig.py
first by adding your customized config and then do the local checkpoint loading trick.
Yes, the
as_trt_engine
lines are where the engines got really built. Did you see some log in the notebook like TRT is building the engine (and usually this engine building takes a while), or it just used an existing*.engine
file and quickly went through the building part? For most clean check, you can check you saving fpath and delete those *.engine files and re-run the notebook steps again.
I tried to set up a large batch size (e.g. 32) for engine, but actual input sometimes could be smaller (e.g. 1). The encoder returns a size of full batch (32), and the decoder also requires to take such a big batch size as input, otherwise it returns Error. At the same time the inference takes much longer time than the engine with batch size = 1.
How can we use the engine for inference with a dynamic batch size more efficiently?
(Pdb) input_ids.shape
torch.Size([1, 5])
(Pdb) encoder_last_hidden_state = bart_trt_encoder(input_ids=input_ids)
(Pdb) encoder_last_hidden_state.shape
torch.Size([32, 5, 768])
(Pdb) outputs = bart_trt_decoder.greedy_search(decoder_input_ids, encoder_hidden_states = encoder_last_hidden_state)
(Pdb) outputs.shape
torch.Size([32, 7])
Try to put a smaller batch of data into decoder:
(Pdb) decoder_input_ids = torch.full(
(1, 1), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
).to("cuda:0")
(Pdb) encoder_last_hidden_state = encoder_last_hidden_state[:1]
(Pdb) outputs = bart_trt_decoder.greedy_search(decoder_input_ids, encoder_hidden_states = encoder_last_hidden_state)
*** RuntimeError: The expanded size of the tensor (122880) must match the existing size (3840) at non-singleton dimension 0. Target sizes: [122880]. Tensor sizes: [3840]
The profile is like:
decoder_profile.add(
"input_ids",
min=(1, 1),
opt=(batch_size//2, max_sequence_length // 2),
max=(batch_size, max_sequence_length),
)