TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

BART Error: 'BARTTRTDecoder' object has no attribute 'trt_context_non_kv'

Open Luckick opened this issue 2 years ago • 16 comments

Description

Follow https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb to build the tensorRT example for Bart model. Get warning in bart_trt_decoder = BARTTRTDecoder(bart_trt_decoder_engine, metadata, tfm_config) Cannot find binding of given name: past_key_values.0.decoder.key and error in outputs = bart_trt_decoder(input_ids, encoder_last_hidden_state) 'BARTTRTDecoder' object has no attribute 'trt_context_non_kv'

Environment

TensorRT Version: '8.4.1.5' NVIDIA GPU: A100-SXM4-40GB NVIDIA Driver Version: 460.73.01 CUDA Version: 11.2 CUDNN Version: 8.0.5 Operating System: Debian GNU/Linux 10 (buster) Python Version (if applicable): 3.7.12 Tensorflow Version (if applicable): PyTorch Version (if applicable): '1.11.0' Baremetal or Container (if so, version):

Relevant Files

'facebook/bart-base'

Steps To Reproduce

6 outputs = bart_trt_decoder(input_ids, encoder_last_hidden_state)

~/projects/bart/code/TensorRT/demo/HuggingFace/NNDF/tensorrt_utils.py in call(self, *args, **kwargs) 166 def call(self, *args, **kwargs): 167 self.trt_context.active_optimization_profile = self.profile_idx --> 168 return self.forward(*args, **kwargs) 169 170 class PolygraphyOnnxRunner:

~/qinqing/projects/bart/code/TensorRT/demo/HuggingFace/BART/trt.py in forward(self, input_ids, encoder_hidden_states, *args, **kwargs) 401 402 # denote as variable to allow switch between non-kv and kv engines in kv cache mode --> 403 trt_context = self.trt_context_non_kv if non_kv_flag else self.trt_context 404 bindings = self.bindings_non_kv if non_kv_flag else self.bindings 405 inputs = self.inputs_non_kv if non_kv_flag else self.inputs

AttributeError: 'BARTTRTDecoder' object has no attribute 'trt_context_non_kv' -->

Luckick avatar Jul 29 '22 14:07 Luckick

Tried python3 run.py compare BART --variant facebook/bart-base --working-dir temp also get error: Collecting Data for onnxrt Traceback (most recent call last): File "run.py", line 297, in main() File "run.py", line 293, in main return action.execute(known_args) File "run.py", line 190, in execute results.append(module.RUN_CMD()) File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/NNDF/interface.py", line 406, in call super().call() File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/NNDF/interface.py", line 99, in call self.metadata = self.args_to_network_metadata(self._args) File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/BART/onnxrt.py", line 314, in args_to_network_metadata precision=Precision(fp16=args.fp16, tf32=args.tf32), AttributeError: 'Namespace' object has no attribute 'tf32'

Luckick avatar Jul 29 '22 14:07 Luckick

@kevinch-nv ^ ^

zerollzeng avatar Jul 30 '22 03:07 zerollzeng

It works with the use_cache=False. Is there any effects without cache?

Luckick avatar Jul 30 '22 04:07 Luckick

You can set use_cache=False for now. The kv cache feature is not fully supported in notebooks yet. We'll add updated notebooks supporting this feature in one of our next releases.

kevinch-nv avatar Aug 01 '22 17:08 kevinch-nv

Thank you for the info! It seems a large batch size is also not supported yet. Could you please confirm?

Luckick avatar Aug 01 '22 17:08 Luckick

Tried python3 run.py compare BART --variant facebook/bart-base --working-dir temp also get error: Collecting Data for onnxrt Traceback (most recent call last): File "run.py", line 297, in main() File "run.py", line 293, in main return action.execute(known_args) File "run.py", line 190, in execute results.append(module.RUN_CMD()) File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/NNDF/interface.py", line 406, in call super().call() File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/NNDF/interface.py", line 99, in call self.metadata = self.args_to_network_metadata(self._args) File "/home/jupyter/projects/bart/code/TensorRT/demo/HuggingFace/BART/onnxrt.py", line 314, in args_to_network_metadata precision=Precision(fp16=args.fp16, tf32=args.tf32), AttributeError: 'Namespace' object has no attribute 'tf32'

@Luckick for this issue, good observation. This was from a version mismatch during this demo development and will be fixed in next update. Meanwhile, you can do: change to precision=Precision(fp16=args.fp16), by removing the tf32 field for line https://github.com/NVIDIA/TensorRT/blob/d90e0d1df80d7d50bd7603fa1dc30773046d36ae/demo/HuggingFace/BART/onnxrt.py#L314. By default it will run FP32/TF32

symphonylyh avatar Aug 01 '22 18:08 symphonylyh

Thank you for the info! It seems a large batch size is also not supported yet. Could you please confirm?

Can you add more information for this? Running with batch should work because the TRT engines built all support batching. Although the example Python commands will run inputs in checkpoint.toml where there is only one put there as an example. This is true for both T5 and BART demos. For notebooks the inputs can be modified to be batched sequences

symphonylyh avatar Aug 01 '22 18:08 symphonylyh

I create 32 duplicates for input and specified the batch size, and it should be passed through the profile. However I get the error. `inputs = tokenizer(["translate English to German: That is good."] * 32, return_tensors="pt") batch_size = 32 max_sequence_length = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[model_name]

decoder_profile = Profile() decoder_profile.add( "input_ids", min=(batch_size, 1), opt=(batch_size, max_sequence_length // 2), max=(batch_size, max_sequence_length), ) decoder_profile.add( "encoder_hidden_states", min=(batch_size, 1, max_sequence_length), opt=(batch_size, max_sequence_length // 2, max_sequence_length), max=(batch_size, max_sequence_length, max_sequence_length), )

encoder_profile = Profile() encoder_profile.add( "input_ids", min=(batch_size, 1), opt=(batch_size, max_sequence_length // 2), max=(batch_size, max_sequence_length), )`

[08/01/2022-18:25:22] [TRT] [E] 3: [executionContext.cpp::setBindingDimensions::965] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::965, condition: profileMaxDims.d[i] >= dimensions.d[i]. Supplied binding dimension [32,768] for bindings[0] exceed min ~ max range at index 0, maximum dimension in profile is 1, minimum dimension in profile is 1, but supplied dimension is 32. ) Traceback (most recent call last): File "bart.py", line 184, in bart_trt_encoder_engine, metadata, tfm_config, batch_size = batch_size File "/home/jupyter/qinqing/projects/bart/code/TensorRT/demo/HuggingFace/BART/trt.py", line 162, in init self.bindings = self._allocate_memory(self.input_shapes, self.input_types, self.output_shapes, self.output_types) File "/home/jupyter/qinqing/projects/bart/code/TensorRT/demo/HuggingFace/BART/trt.py", line 105, in _allocate_memory assert self.trt_context.all_binding_shapes_specified AssertionError

Luckick avatar Aug 01 '22 18:08 Luckick

@Luckick I see. TensorRT inference includes two phases: (1) engine building (2) execution. The above error shows it's at step 2, and it's because your built engines still have fixed batch_size = 1 as valid input dimensions. Note that the Profile only affects step 1, so if you have built the engine with bs=1 in step 1, and only change the Profile parameters for step 2, that wouldn't work.

Therefore, what you should do is to update the profile based on your need, and must build new engines using that profile (you may need to delete the old engine, becasue by default engine building will be skipped if files already exist).

More you can do suppose your use case has dynamic batch size: you can specify in the min, opt, max above to take, e.g., many available dimensions from bs=1 to bs=xxx. --> this is a TensorRT feature called Dynamic Shape

symphonylyh avatar Aug 01 '22 18:08 symphonylyh

The batch size assignments are before the tenorRT engine. I am following https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb, and the batch size was assigned at the very beginning of tensorRT section.

The code below are AFTER the batch assignment: `bart_trt_encoder_engine = BARTEncoderONNXFile( os.path.join(onnx_model_path, encoder_onnx_model_fpath), metadata ).as_trt_engine(os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + ".engine", profiles=[encoder_profile])

bart_trt_decoder_engine = BARTDecoderONNXFile( os.path.join(onnx_model_path, decoder_onnx_model_fpath), metadata ).as_trt_engine(os.path.join(tensorrt_model_path, decoder_onnx_model_fpath) + ".engine", profiles=[decoder_profile])

from BART.trt import BARTTRTEncoder, BARTTRTDecoder

tfm_config = BartConfig( use_cache=False, num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[model_name], )

bart_trt_encoder = BARTTRTEncoder( bart_trt_encoder_engine, metadata, tfm_config, batch_size = batch_size ) bart_trt_decoder = BARTTRTDecoder( bart_trt_decoder_engine, metadata, tfm_config, batch_size = batch_size )`

Luckick avatar Aug 01 '22 18:08 Luckick

Yes, the as_trt_engine lines are where the engines got really built. Did you see some log in the notebook like TRT is building the engine (and usually this engine building takes a while), or it just used an existing *.engine file and quickly went through the building part? For most clean check, you can check you saving fpath and delete those *.engine files and re-run the notebook steps again.

symphonylyh avatar Aug 01 '22 19:08 symphonylyh

I see. I was using the old engine built with batch_size=1.

Another modification: for decode profile, we should have hidden_dim = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[model_name]

decoder_profile.add( "encoder_hidden_states", min=(batch_size, 1, hidden_dim), opt=(batch_size, max_sequence_length // 2, hidden_dim), max=(batch_size, max_sequence_length, hidden_dim), )

Luckick avatar Aug 01 '22 20:08 Luckick

I see. I was using the old engine built with batch_size=1.

Another modification: for decode profile, we should have hidden_dim = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[model_name]

decoder_profile.add( "encoder_hidden_states", min=(batch_size, 1, hidden_dim), opt=(batch_size, max_sequence_length // 2, hidden_dim), max=(batch_size, max_sequence_length, hidden_dim), )

It's good to hear the issue is solved by cleaning engine cache.

For the modification you mentioned, yes if you're starting based on T5 it's recommended to do such changes. Actually, if you run Python commands, this is already fixed here. The story is, T5 has certain legacies like mixing MAX_ENCODER_LENGTH & ENCODER_HIDDEN_SIZE, and we fixed that in BART demo and plan to back port it to T5 demo later too.

Meanwhile, for any other users who encounter the similar issue @Luckick has, you're advised to follow the discussion on this page to make a BART notebook working by modifying from T5's notebook, before we officially release the BART notebook.

symphonylyh avatar Aug 01 '22 21:08 symphonylyh

Is it a command or convenient way to set up the engine for a local checkpoint of fine-tuned bart model, or a customized bart model?

Luckick avatar Aug 02 '22 03:08 Luckick

Is it a command or convenient way to set up the engine for a local checkpoint of fine-tuned bart model, or a customized bart model?

The easiest way I can think of without making structural changes is to go into frameworks.py: generate_and_download_framework(), and simply replace .from_pretrained(metadata.variant) with your local checkpoint .from_pretrained(checkpoint_file), suppose you fine-tuned on one of the bart-base, bart-large, bart-large-cnn models. If not, you may modify BARTModelConfig.py first by adding your customized config and then do the local checkpoint loading trick.

symphonylyh avatar Aug 02 '22 06:08 symphonylyh

Yes, the as_trt_engine lines are where the engines got really built. Did you see some log in the notebook like TRT is building the engine (and usually this engine building takes a while), or it just used an existing *.engine file and quickly went through the building part? For most clean check, you can check you saving fpath and delete those *.engine files and re-run the notebook steps again.

I tried to set up a large batch size (e.g. 32) for engine, but actual input sometimes could be smaller (e.g. 1). The encoder returns a size of full batch (32), and the decoder also requires to take such a big batch size as input, otherwise it returns Error. At the same time the inference takes much longer time than the engine with batch size = 1.

How can we use the engine for inference with a dynamic batch size more efficiently?

(Pdb) input_ids.shape
torch.Size([1, 5])
(Pdb) encoder_last_hidden_state = bart_trt_encoder(input_ids=input_ids)
(Pdb) encoder_last_hidden_state.shape
torch.Size([32, 5, 768])
(Pdb) outputs = bart_trt_decoder.greedy_search(decoder_input_ids, encoder_hidden_states = encoder_last_hidden_state)
(Pdb) outputs.shape
torch.Size([32, 7])

Try to put a smaller batch of data into decoder:

(Pdb) decoder_input_ids = torch.full(                                                     
    (1, 1), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
).to("cuda:0") 
(Pdb) encoder_last_hidden_state = encoder_last_hidden_state[:1]
(Pdb) outputs = bart_trt_decoder.greedy_search(decoder_input_ids, encoder_hidden_states = encoder_last_hidden_state)
*** RuntimeError: The expanded size of the tensor (122880) must match the existing size (3840) at non-singleton dimension 0.  Target sizes: [122880].  Tensor sizes: [3840]

The profile is like:

decoder_profile.add(
    "input_ids",
    min=(1, 1),
    opt=(batch_size//2, max_sequence_length // 2),
    max=(batch_size, max_sequence_length),
)

Luckick avatar Aug 04 '22 18:08 Luckick