monai-deploy-app-sdk icon indicating copy to clipboard operation
monai-deploy-app-sdk copied to clipboard

GPU Isolation and flexible deployment strategies [FEA]

Open vikashg opened this issue 3 years ago • 2 comments

Is your feature request related to a problem? Please describe. If we consider a few scenarios where we need

  • to deploy multiple models for a single application.
  • deploy multiple models on the same machine with different GPU architectures.
  • lockin resources for deployment so I can do training with the remaining resources.

In all these examples, we want to assign a GPU to a model and do not want the inference service to take up the entire system. If we can isolate the GPU and pin it to a particular deployment, it will be really useful. In addition, this will also future proof our deployments. Imagine a scenario where we get new GPUs with new architectures. Maybe the deployment and the model and pytorch versions do not work with the new architecture. In such a case, we can add more GPUs without disturbing the deployments.

Describe alternatives you've considered @slbryson has tried GPU isolation using clara CLI tools.

Additional context

vikashg avatar Jan 21 '22 20:01 vikashg

This also ties in loosly to what @MMelQin was mentioning about trying to have multiple models deployed in a MAP

vikashg avatar Jan 21 '22 21:01 vikashg

This is definitely a good request for a much needed capability, though more for a deployment platform, e.g. Clara inference operators/applications uses remote Triton Inference Service which supports model to GPU affinity, number of instances per model etc, so, Triton configuration can be used for distributing model instance(s) to GPU.

App SDK does have an issue for utilizing remote Triton inference service, #212

As for multi-model support, #244, when all the inference operators use in-proc inference, it is possible to

  • link the operators in the app (application.add_flow()) in such a way that only one inference operator can run at any given time, such that GPU is not overloaded.
  • Potentially enhance the model loading logic in the App SDK base Application to make use of specific GPU if so configured, but this becomes moot if remote Triton is used.

MMelQin avatar Jan 22 '22 00:01 MMelQin