tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Add usage in response like openai?

Open npuichigo opened this issue 2 years ago • 8 comments

https://platform.openai.com/docs/api-reference/completions/object#completions/object-usage What about add usage in trt ensemble models to return the token usage like openai? At lease the prompt and output token length. It would be eaiser to provide an OpenAI compatible API.

npuichigo avatar Dec 10 '23 13:12 npuichigo

https://platform.openai.com/docs/api-reference/completions/object#completions/object-usage What about add usage in trt ensemble models to return the token usage like openai? At lease the prompt and output token length. It would be eaiser to provide an OpenAI compatible API.

Have you solved the problem?

shatealaboxiaowang avatar Jan 08 '24 10:01 shatealaboxiaowang

not yet

npuichigo avatar Jan 08 '24 10:01 npuichigo

not yet

Do you know how to do it? Any ideas?

shatealaboxiaowang avatar Jan 09 '24 09:01 shatealaboxiaowang

I think u could customize the logic in postprocess and preprocess to do the calculation.

npuichigo avatar Jan 09 '24 11:01 npuichigo

I think u could customize the logic in postprocess and preprocess to do the calculation.

Thank you. I tried. It didn't work

shatealaboxiaowang avatar Jan 17 '24 09:01 shatealaboxiaowang

I managed to get the output_token_len to the output, but can't add the input_token_len since this information is not directly passed down from the pipeline to the postprocessing model.

Here's how to do it:

We need to create a new output field in the postprocessing model, and make small changes to the code to handle the information retrieval and output.

The first step is to modify the postprocessing\config.pbtxt, add the following content:


output [
  {
    name: "OUTPUT_TOKEN_LEN"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },

  ...

]

Then we need to chagne postprocessing\1\model.py to add logic to output the tensor corresponding to the above output field.


class TritonPythonModel:

    def initialize(self, args):

        ...

        # Parse model output configs
        output_names = ["OUTPUT", "OUTPUT_TOKEN_LEN"]
        for output_name in output_names:
            setattr(
                self,
                output_name.lower() + "_dtype",
                pb_utils.triton_string_to_numpy(
                    pb_utils.get_output_config_by_name(
                        model_config, output_name)['data_type']))

    def execute(self, requests):

        ...

        # Number of tokens
        output_token_len_tensor = pb_utils.Tensor(
            'OUTPUT_TOKEN_LEN',
            np.array(sequence_lengths).astype(self.output_token_len_dtype))
        outputs.append(output_token_len_tensor)


Then, we can modify the ensemble\config.pbtxt, add the new output field to both output fields and the ensemble pipeline, as shown in the following content:

output[
  {
    name: "output_token_len"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },

  ...

]


ensemble_scheduling {
  step [
      {
      model_name: "postprocessing"
      model_version: -1
      
      ...

      output_map {
        key: "OUTPUT_TOKEN_LEN"
        value: "output_token_len"
      }
  ]
}


michaelnny avatar May 25 '24 04:05 michaelnny

you can use https://github.com/npuichigo/openai_trtllm , it is a wrapper to create openai compatible api for tensorRT-LLM

MrD005 avatar Jul 29 '24 08:07 MrD005

any progress on this issue?

cocodee avatar Dec 31 '24 04:12 cocodee