serving icon indicating copy to clipboard operation
serving copied to clipboard

Multi-GPU inference support

Open AliJahan opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe.

NVIDIA's Triton inference server provides a feature with which the user is able to load models in multiple GPUs for inference (Nivida's terminology for this feature is instance groups). I could not find such a feature in Pytorch Serve framework. I was wondering if this feature exists or if it is work in progress.

Describe the solution

  • Adding Multi-GPU support for inference (CUDA and HIP)
  • Adding load balancer/request scheduler for processing inference requests on the GPUs on which the models is loaded.

AliJahan avatar Dec 07 '21 02:12 AliJahan

@sanjoy Sanjoy could you help take a look at this FR?

yimingz-a avatar Dec 10 '21 00:12 yimingz-a

@AliJahan,

RRemote Predict Op is an experimental TensorFlow operator that enables users to make a Predict RPC from within a TensorFlow graph executing on machine A to another graph hosted by TensorFlow Serving on machine B.

Hope this answers your query. Thank you!

Warning: This is an experimental feature, not yet supported by TensorFlow Serving team and may change or be removed at any point.

singhniraj08 avatar Feb 03 '23 07:02 singhniraj08

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!

singhniraj08 avatar Feb 20 '23 06:02 singhniraj08