serving
serving copied to clipboard
Multi-GPU inference support
Is your feature request related to a problem? Please describe.
NVIDIA's Triton inference server provides a feature with which the user is able to load models in multiple GPUs for inference (Nivida's terminology for this feature is instance groups). I could not find such a feature in Pytorch Serve framework. I was wondering if this feature exists or if it is work in progress.
Describe the solution
- Adding Multi-GPU support for inference (CUDA and HIP)
- Adding load balancer/request scheduler for processing inference requests on the GPUs on which the models is loaded.
@sanjoy Sanjoy could you help take a look at this FR?
@AliJahan,
RRemote Predict Op is an experimental TensorFlow operator that enables users to make a Predict RPC from within a TensorFlow graph executing on machine A to another graph hosted by TensorFlow Serving on machine B.
Hope this answers your query. Thank you!
Warning: This is an experimental feature, not yet supported by TensorFlow Serving team and may change or be removed at any point.
Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!