are FP8 models supported in Triton ??
We have an encoder based model, and we have currently deployed in FP16 mode in production and we want to reduce the latecny further.
Does triton support FP8 ? In the datatypes documentation here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#datatypes I don't see FP8 in the datatypes.
We are using trtexec CLI to convert onnx to trt engine file. I see an option --fp8 to generate fp8 serialized engine files. Can anyone confirm if we can deploy FP8 models in triton?
@oandreeva-nv can you help with this ^^ ?
Hi @jayakommuru, let me verify it. I'll get back to you
RT backend does not support FP8 I/O for the TRT engine. However, weights and internal tensors can be FP8.
@oandreeva-nv Ok, Can there be any throughput/performance benefits by running FP8 TRT engine file with FP16 I/O? which triton data type should be used with FP8 TRT engine file in TRT backend ?
@oandreeva-nv can you confirm if using FP16 I/O triton datatypes and FP8 TRT engine, does it give any benefit? Thanks
Hi @jayakommuru , we have a perf_analyzer tool, that can help you analyzing the performance of your model.
@oandreeva-nv Sure, will explore the perf-analyzer. Any idea whether to use FP32 or FP16 I/O datatype of triton for TensorRT FP8 models ?