Model orchestration with heterogeneous hardwares
We meet a few cases that single deployment needs to be deployed across different chips due to quota or resource shortage. However, in Kubernetes, most of the time we use Deployment to manage a group of pods using one type of GPU, If we remove GPU type constraints, then it's hard to control the ratio. Technically, we can workaround the problem using multiple deployment, but the rolling upgrade control additional control, same as HPA. The RoleSet CRD is not able to manage the such cases as well.
- We may need other orchestrators for instances using heterogeneous hardwares, HPA, Rolling upgrade need to be revised as well.
- We need more advanced Traffic Routing solutions to handle such differences
- It also brings lots of challenges on monitoring at the service level etc
I am considering to build a Model abstraction to hide the deployment details for users. It should cross GPU devices, cross clouds etc. It will leave us enough room for cost/performance optimization
related paper: https://arxiv.org/abs/2404.14527
We do not have plan in v0.2.0 to change the orchestration part. Let's firstly resolve the cost-efficient serving issue using multiple deployment with some common labels, that's enough. I will change this issue to a feature and part of RFC heterogenous part
this is a sub-story of #425, we may use a lose way like labels to orchestrate the workload in v0.2.0. We can better orchestrate such workloads in v0.3.0 with model api. Postpone to v0.3.0.