server are FP8 models supported in Triton ??

We have an encoder based model, and we have currently deployed in FP16 mode in production and we want to reduce the latecny further.

Does triton support FP8 ? In the datatypes documentation here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#datatypes I don't see FP8 in the datatypes.

We are using trtexec CLI to convert onnx to trt engine file. I see an option --fp8 to generate fp8 serialized engine files. Can anyone confirm if we can deploy FP8 models in triton?

Oct 04 '24 06:10 jayakommuru

@oandreeva-nv can you help with this ^^ ?

Oct 04 '24 07:10 jayakommuru

Hi @jayakommuru, let me verify it. I'll get back to you

Oct 04 '24 19:10 oandreeva-nv

RT backend does not support FP8 I/O for the TRT engine. However, weights and internal tensors can be FP8.

Oct 04 '24 20:10 oandreeva-nv

@oandreeva-nv Ok, Can there be any throughput/performance benefits by running FP8 TRT engine file with FP16 I/O? which triton data type should be used with FP8 TRT engine file in TRT backend ?

Oct 05 '24 04:10 jayakommuru

@oandreeva-nv can you confirm if using FP16 I/O triton datatypes and FP8 TRT engine, does it give any benefit? Thanks

Oct 06 '24 07:10 jayakommuru

Hi @jayakommuru , we have a perf_analyzer tool, that can help you analyzing the performance of your model.

Oct 07 '24 18:10 oandreeva-nv

@oandreeva-nv Sure, will explore the perf-analyzer. Any idea whether to use FP32 or FP16 I/O datatype of triton for TensorRT FP8 models ?

Oct 08 '24 16:10 jayakommuru