server Issue with FP16 Output for Half Precision Model

Description I am running a half-precision ONNX model inside Triton with FP16 data type for both the input and output. I have a downstream component (in a separate Kubernetes pod from that runs the Triton container) that post processes the output from Triton, but it does not support FP16. The downstream component is KServe following the open inference protocol, see proto specifications.

The communication between the two pods is through gRPC.

Since I was able to send via gRPC FP16 into and receive from Triton (before hitting the downstream issue), is there a way by which Triton handles FP16 data? But it doesn't seem to be implemented in your grpc_service.proto here.
Due to the downstream limitation, I'm wondering if you have any suggestions about workarounds?

config.pbtxt

name: "object-detectron"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
  {
    name: "images"
    data_type: TYPE_FP16
    dims: [ 3, 800, 800 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
  }
]

Details of my setup:

Triton: 23.12
Using Triton through KServe
Service architecture: preprocessing and postprocessing are in one Kubernetes Pod, prediction is in a second pod which runs the Triton container. Communication between the pods is through gRPC.
Model: YOLOv8 in ONNX half-precision format.

@tanmayv25

Expected behavior Being able to receive FP16 model output from Triton via gRPC in a format that can be deserialized by the downstream.

Link #5960

Mar 22 '24 04:03 langong347

CC @tanmayv25

Mar 25 '24 19:03 indrajit96

Since I was able to send via gRPC FP16 into and receive from Triton (before hitting the downstream issue), is there a way by which Triton handles FP16 data?

Triton does not attempt to interpret tensor data held within inference request or the output response from the model for fp16. Triton grpc inference protocol directly matches with the open inference protocol.

Because of the reason described in the comment section here, FP16 and BF16 data should be represented as raw input contents of type bytes.

The simple and direct solution should be to redirect the raw_output_content from Triton to raw_input_content for the downstream service. However, as you mentioned that downstream service does not support FP16, you might have to parse the output tensors from raw_output_content element by element and reinterpret it as FP32 or any other supported format.

This approach however will suffer from a performance penalty. My suggested solution would be to modify the post-process service to support FP16 datatype. As the datatype is coming from the model, it makes sense for the post-process to support FP16.

Mar 27 '24 19:03 tanmayv25

Hi @tanmayv25 Thanks for your reply. Our post-process service (using KServe) uses open inference protocol, so there's no way for it to accept FP16 output from Triton. However, after some considerations I think it would be more performant to do the FP32 <> FP16 transformation inside the Triton container (running GPU inference) because the converted tensor(s) will already be on the same GPU.

So extending this thought a bit:

For half-precision model, we can consider absorbing the FP32 <> FP16 conversion as custom preprocess (postprocess) before (after) the inference model inside Triton server.
I've found a similar example using the python backend here that preprocesses image data, although it does more than data type conversion.
Ensemble might be another alternative, but the python backend looks like a more lightweight approach that addresses the problem.

Please let me know what you think.

Apr 01 '24 20:04 langong347

Our post-process service (using KServe) uses open inference protocol, so there's no way for it to accept FP16 output from Triton.

Triton implements open inference protocol as well and the protocol proposes using raw_input_content and raw_output_content fields for communicating among gRPC services? Why not pipe Triton's raw_output_content from Triton to open inference protocol's raw_input_content for the downstream service.

Note: These are just uint8_t raw bytes and can represent any byte data.

Adding these layers before and after the models will come with a performance impact as you'd have to serialize and deserialize the FP16 tensors array. If the tensor size is not that large, then the impact might be minimal.

Apr 02 '24 00:04 tanmayv25

If you are okay with the additional overhead and tensor sizes are small, then you can write python backend models for FP32 <> FP16 conversions as described in the pre-process example above.

Apr 02 '24 00:04 tanmayv25

Hi @tanmayv25

What I meant earlier by "there's no way for it to accept FP16 output from Triton" is that the open inference protocol does not support fp16 type, i.e., we cannot send the model output in fp16 directly to the post-process service, in response to your earlier suggestion "My suggested solution would be to modify the post-process service to support FP16 datatype".

I agree that we can use raw_input/output_content as the alternative. However, this approach will require me to serialize (deserialize) FP16 to (from) raw byte format, which can add some performance overhead between Triton and the post-processing service. Correct?

Apr 02 '24 18:04 langong347

However, this approach will require me to serialize (deserialize) FP16 to (from) raw byte format, which can add some performance overhead between Triton and the post-processing service. Correct?

You don't have to serialize(deserialize) FP16 to raw byte format. Within Triton, onnx backend will directly write the FP16 tensor data into the protobuf message repeated bytes field - raw_output_contents. Then when forming the protobuf request message for the post-process service, you can directly append raw_output_contents buffer into the raw_input_contents field. See how we do this in our client here. All this logic doesn't care if the result was in FP16. It just treats it as raw bytes. There will be a data copy which anyways would have been conducted when marshaling the request protobuf message. Within the post-process service, you'd have an option for converting FP16 -> FP32 or use any other library that directly operates on FP16 data type.

Apr 02 '24 22:04 tanmayv25

Hi @tanmayv25

You don't have to serialize(deserialize) FP16 to raw byte format. Within Triton, onnx backend will directly write the FP16 tensor data into the protobuf message repeated bytes field - raw_output_contents

My current model config file uses TYPE_FP16 as the output format, but the data type was not recognized by the post-process service. Do I need to use a different output data type (in the model config file) to "trigger" the writing of FP16 into raw_output_contents?

According to your doc[https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#datatypes], is TYPE_STRING the right format?

|Model Config  |TensorRT      |TensorFlow    |ONNX Runtime  |PyTorch  |API      |NumPy         |
|--------------|--------------|--------------|--------------|---------|---------|---------
|TYPE_STRING   |              |DT_STRING     |STRING        |         |BYTES    |dtype(obje

Apr 04 '24 14:04 langong347

TYPE_FP16 should be fine for getting FP16 data into raw_output_contents. You did mention that you were able to get the results from Triton already.

Since I was able to send via gRPC FP16 into and receive from Triton (before hitting the downstream issue)

I am quite positive that you were getting the FP16 data from the raw_output_content. As per my understanding there is no issue in retrieving FP16 data from Triton. But you are unable to pass it it the post-process service because it doesn't handle FP16 datatype. I am suggesting to pass these raw_output_content bytes from Triton to raw_input_content bytes of the service. Note that data type is a string FP16 and assuming that you own the service you can add FP16 handling within it with just a conditional check.

That being said, if you don't own the service and would want to perform element by element serialization in Triton itself, then you can do so within a separate custom model(python or C++) and connect it to ORT model within an ensemble.

Apr 04 '24 20:04 tanmayv25

Hi @tanmayv25 Thank you for your confirmation. I need to check with the downstream service about supporting raw_output_content as raw_input_content when dtype is fp16. For now I have implemented a workaround by modifying the last layer of my ONNX model to convert data type. If I don't get back in time, please feel free to close this issue.

Apr 09 '24 15:04 langong347

Closing the issue due to lack of activity.

May 16 '24 22:05 tanmayv25

server server copied to clipboard

Issue with FP16 Output for Half Precision Model

server
server copied to clipboard