seldon-core
seldon-core copied to clipboard
seldon core v2: improved autoscaling
What is the behavior of seldon core v2 in the following scenario?
- A single server with HPA based on 50% CPU usage which scales between 1 and 5 instances. Single model serving only.
- A single model with requirements compatible with the server with SELDON_MODEL_INFERENCE_LAG set to 100ms for scaling up and SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD set to 1 second for scaling down.
- Many requests are coming in so CPU usage spikes so there are 5 servers and also requests are being handled slowly to there are also 5 models, 1 per server.
- Requests go down, so CPU usage drops below the 50% threshold and servers should scale down. But, requests are still coming in such that no model is actually inactive for a full second even though inference lag is also fairly low.
Related feature requests:
- Have models (not servers) scale up and down based on a single metric. Inactivity for scaling down in particular seems inappropriate if one receives a constant baseline number of requests per second. I might be misunderstanding how the scaling works and this is possible. I am also assuming requests are dispersed evenly between models, perhaps the strategy is to saturate a model before hitting a second version so even in my scenario one model will be inactive.
- The ability to scale servers up and down based on number of models needed. This is maybe impossible while the mapping between the two is done based on server capabilities and model requirements.
- The ability to deploy models to specific servers, i.e., reference the named k8s object. We do this by giving servers very specific names that we mirror in their capabilities. This is perhaps not in line with seldon core v2 philosophy. Or is maybe possible and I'm ignorant as to how to do it?
Thanks for your help!
Maybe one solution is to allow model autoscaling to be switched off, plus the ability for models to be defined to be locked to all server replicas so if the server autoscales all models are added to it. Scale down scenarios are also handled this way. So essentially that is delegating auto-scaling to server HPA/KEDA and is more akin to Seldon Core V1 except multi-models can also be scaled this way? New model joiners would need to be added to all replicas. @sakoush