fastertransformer_backend
fastertransformer_backend copied to clipboard
Serving large models with FT backend keeps Triton server crashing and restarting
We are trying to run Triton with FasterTransformer backend on a GKE cluster with A100 GPUs to serve models such as T5, UL2 which are hosted on Google Cloud Storage repo. We are using BigNLP containers (nvcr.io/ea-bignlp/bignlp-inference:22.08-py3
) to run Triton
Based on the instructions, we are able to bring up the Triton inference server for T5-small models. However, when the repo has large models T5-XXL or UL2, the Triton server keeps crashing and restarting without any meaningful logs to troubleshoot.
Logs when serving T5-XXL model
Logs when serving T5-Small model