server example for Kubernetes configuration multi-node multi-gpu

example for Kubernetes configuration multi-node multi-gpu

Open gyin94 opened this issue 3 years ago • 1 comments

trafficstars

how can we set up the Kubernetes to request multi-node multi-gpu for serving model-parallelism or tensor-parallelism mentioned in FasterTransformer backend or other model parallelism by pytorch/tensorflow? The current aws k8s example in triton server is for single node single gpu.

Aug 31 '22 00:08 gyin94

@rmccorm4 Are you able to provide more context for this?

Sep 07 '22 02:09 krishung5

ping @rmccorm4 ?

Nov 22 '22 03:11 jbkyang-nvi

Hi @rossbucky,

Multi-node inference is specific to the FasterTransformer backend for now, so please ask any multi-node or fastertransformer specific questions there instead: https://github.com/triton-inference-server/fastertransformer_backend/issues

Model parallelism (multiple copies of models to spread inference load across multiple models concurrently) via multi-gpu serving is supported by Triton in general by exposing multiple/all GPUs in the necessary instance groups.

Tensor parallelism (concurrently executing different parts of the same inference request across different gpus) is not currently supported generally in Triton, as far as I'm aware. There may be some support specifically in the fastertransformer backend, but please direct those questions to their repo. CC @byshiue

Jan 28 '23 00:01 rmccorm4

Closing due to inactivity. Please let us know if you need us to reopen the issue for follow-up.

Feb 13 '23 19:02 the-david-oy

server server copied to clipboard

example for Kubernetes configuration multi-node multi-gpu

server
server copied to clipboard