server icon indicating copy to clipboard operation
server copied to clipboard

example for Kubernetes configuration multi-node multi-gpu

Open gyin94 opened this issue 2 years ago • 1 comments

how can we set up the Kubernetes to request multi-node multi-gpu for serving model-parallelism or tensor-parallelism mentioned in FasterTransformer backend or other model parallelism by pytorch/tensorflow? The current aws k8s example in triton server is for single node single gpu.

gyin94 avatar Aug 31 '22 00:08 gyin94

@rmccorm4 Are you able to provide more context for this?

krishung5 avatar Sep 07 '22 02:09 krishung5

ping @rmccorm4 ?

jbkyang-nvi avatar Nov 22 '22 03:11 jbkyang-nvi

Hi @rossbucky,

Multi-node inference is specific to the FasterTransformer backend for now, so please ask any multi-node or fastertransformer specific questions there instead: https://github.com/triton-inference-server/fastertransformer_backend/issues

Model parallelism (multiple copies of models to spread inference load across multiple models concurrently) via multi-gpu serving is supported by Triton in general by exposing multiple/all GPUs in the necessary instance groups.

Tensor parallelism (concurrently executing different parts of the same inference request across different gpus) is not currently supported generally in Triton, as far as I'm aware. There may be some support specifically in the fastertransformer backend, but please direct those questions to their repo. CC @byshiue

rmccorm4 avatar Jan 28 '23 00:01 rmccorm4

Closing due to inactivity. Please let us know if you need us to reopen the issue for follow-up.

dyastremsky avatar Feb 13 '23 19:02 dyastremsky