server
server copied to clipboard
example for Kubernetes configuration multi-node multi-gpu
how can we set up the Kubernetes to request multi-node multi-gpu for serving model-parallelism or tensor-parallelism mentioned in FasterTransformer backend or other model parallelism by pytorch/tensorflow? The current aws k8s example in triton server is for single node single gpu.
@rmccorm4 Are you able to provide more context for this?
ping @rmccorm4 ?
Hi @rossbucky,
Multi-node inference is specific to the FasterTransformer backend for now, so please ask any multi-node or fastertransformer specific questions there instead: https://github.com/triton-inference-server/fastertransformer_backend/issues
Model parallelism (multiple copies of models to spread inference load across multiple models concurrently) via multi-gpu serving is supported by Triton in general by exposing multiple/all GPUs in the necessary instance groups.
Tensor parallelism (concurrently executing different parts of the same inference request across different gpus) is not currently supported generally in Triton, as far as I'm aware. There may be some support specifically in the fastertransformer backend, but please direct those questions to their repo. CC @byshiue
Closing due to inactivity. Please let us know if you need us to reopen the issue for follow-up.