serve
serve copied to clipboard
Model & Instance scaling
📚 The doc issue
Hi, I have some questions regarding model scaling.
We're currently running one torchserve instance on a kubernetes cluster for a bunch of models that come under a variety of loads throughout the day at different times. After a bit of digging I've worked out that there is no auto scaling feature in torchserve (which is a bit misleading with the minWorkers and maxWorkers). Therefore, it seems our only solution would be to have horizontal scaling on kubernetes documented here https://github.com/pytorch/serve/blob/master/kubernetes/autoscale.md. However, as our models have a varying degree of load at any one time we don't really want to be scaling them all at the same time.
- Are we doing something fundamentally wrong with our setup? Maybe one torchserve instance for each "group" of models
- If our setup isn't flawed, would it be worth creating a sidecar container or separate application which monitors the queue time for each model and scales them up or down using the management API?
Thanks
Suggest a potential alternative/fix
No response