tensorrtllm_backend Add usage in response like openai?

https://platform.openai.com/docs/api-reference/completions/object#completions/object-usage What about add usage in trt ensemble models to return the token usage like openai? At lease the prompt and output token length. It would be eaiser to provide an OpenAI compatible API.

Dec 10 '23 13:12 npuichigo

https://platform.openai.com/docs/api-reference/completions/object#completions/object-usage What about add usage in trt ensemble models to return the token usage like openai? At lease the prompt and output token length. It would be eaiser to provide an OpenAI compatible API.

Have you solved the problem?

Jan 08 '24 10:01 shatealaboxiaowang

not yet

Jan 08 '24 10:01 npuichigo

not yet

Do you know how to do it? Any ideas?

Jan 09 '24 09:01 shatealaboxiaowang

I think u could customize the logic in postprocess and preprocess to do the calculation.

Jan 09 '24 11:01 npuichigo

I think u could customize the logic in postprocess and preprocess to do the calculation.

Thank you. I tried. It didn't work

Jan 17 '24 09:01 shatealaboxiaowang

I managed to get the output_token_len to the output, but can't add the input_token_len since this information is not directly passed down from the pipeline to the postprocessing model.

Here's how to do it:

We need to create a new output field in the postprocessing model, and make small changes to the code to handle the information retrieval and output.

The first step is to modify the postprocessing\config.pbtxt, add the following content:


output [
  {
    name: "OUTPUT_TOKEN_LEN"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },

  ...

]

Then we need to chagne postprocessing\1\model.py to add logic to output the tensor corresponding to the above output field.


class TritonPythonModel:

    def initialize(self, args):

        ...

        # Parse model output configs
        output_names = ["OUTPUT", "OUTPUT_TOKEN_LEN"]
        for output_name in output_names:
            setattr(
                self,
                output_name.lower() + "_dtype",
                pb_utils.triton_string_to_numpy(
                    pb_utils.get_output_config_by_name(
                        model_config, output_name)['data_type']))

    def execute(self, requests):

        ...

        # Number of tokens
        output_token_len_tensor = pb_utils.Tensor(
            'OUTPUT_TOKEN_LEN',
            np.array(sequence_lengths).astype(self.output_token_len_dtype))
        outputs.append(output_token_len_tensor)

Then, we can modify the ensemble\config.pbtxt, add the new output field to both output fields and the ensemble pipeline, as shown in the following content:

output[
  {
    name: "output_token_len"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },

  ...

]


ensemble_scheduling {
  step [
      {
      model_name: "postprocessing"
      model_version: -1
      
      ...

      output_map {
        key: "OUTPUT_TOKEN_LEN"
        value: "output_token_len"
      }
  ]
}

May 25 '24 04:05 michaelnny

you can use https://github.com/npuichigo/openai_trtllm , it is a wrapper to create openai compatible api for tensorRT-LLM

Jul 29 '24 08:07 MrD005

any progress on this issue?

Dec 31 '24 04:12 cocodee