Results 29 comments of Kiran R

> Is there any way to use for example BERT to implement same solution? this [paper](https://www.aclweb.org/anthology/D19-5821.pdf) explains how to use the BERT for QG. But I haven't found any code...

@pommedeterresautee thanks, Do you plan on adding support for the Triton server?

Great! I tried T5 with cache (i.e. with `past-key-values`) on `triton server`. For generating every single token, the python backend was making lots of requests (`24 pkv + 1 logits`...

Great find! Thanks for fixing the bug. Sorry for the replying late on this. As mentioned above, I'm trying to serve the T5 model from triton server. I have an...

thanks for the response and tip, the execution of the onnx model part is slow ```python inference_request = pb_utils.InferenceRequest( model_name=self.model_path, requested_output_names=["logits"] + self.output_pkv_names, inputs=[input_ids, encoder_attention_mask, encoder_hidden_states] + input_past_key_values, ) inference_response...

> just to be sure, you are using 2 decoders and 1 encoder, right? yes > why do you need tensor.cuda() in get_output_tensors ? It should already be on GPU....

You are getting this error because you've set the `max_length=100` and sent `input_ids` of length 18. That's why it stops at 18%. If you provide max_length as 18 you'll get...

sorry for the late reply, you can get the `hidden states` of the encoder easily just by sending in the `input_ids` and `attention mask` to the encoder as shown below...

yes, it is bit confusing, it should be `model_name`, not `model_name_or_path`, I'll make this change in the next update. the purpose of the `model_name` (the current `model_name_or_path`) is to select...

can you provide the device specifications and code you are using to test the speed?