serving Multi-GPU inference support

Multi-GPU inference support

Open AliJahan opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? Please describe.

NVIDIA's Triton inference server provides a feature with which the user is able to load models in multiple GPUs for inference (Nivida's terminology for this feature is instance groups). I could not find such a feature in Pytorch Serve framework. I was wondering if this feature exists or if it is work in progress.

Describe the solution

Adding Multi-GPU support for inference (CUDA and HIP)
Adding load balancer/request scheduler for processing inference requests on the GPUs on which the models is loaded.

Dec 07 '21 02:12 AliJahan

@sanjoy Sanjoy could you help take a look at this FR?

Dec 10 '21 00:12 yimingz-a

@AliJahan,

RRemote Predict Op is an experimental TensorFlow operator that enables users to make a Predict RPC from within a TensorFlow graph executing on machine A to another graph hosted by TensorFlow Serving on machine B.

Hope this answers your query. Thank you!

Warning: This is an experimental feature, not yet supported by TensorFlow Serving team and may change or be removed at any point.

Feb 03 '23 07:02 singhniraj08

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!

Feb 20 '23 06:02 singhniraj08

serving serving copied to clipboard

Multi-GPU inference support

Is your feature request related to a problem? Please describe.

Describe the solution

serving
serving copied to clipboard