Serving large models with FT backend keeps Triton server crashing and restarting

Open RajeshThallam opened this issue 2 years ago • 0 comments

We are trying to run Triton with FasterTransformer backend on a GKE cluster with A100 GPUs to serve models such as T5, UL2 which are hosted on Google Cloud Storage repo. We are using BigNLP containers (nvcr.io/ea-bignlp/bignlp-inference:22.08-py3) to run Triton

Based on the instructions, we are able to bring up the Triton inference server for T5-small models. However, when the repo has large models T5-XXL or UL2, the Triton server keeps crashing and restarting without any meaningful logs to troubleshoot.

Logs when serving T5-XXL model

Logs when serving T5-Small model

Jan 26 '23 17:01 RajeshThallam