serving
serving copied to clipboard
Preferred / cheapest way to host TF Serving In the cloud?
I'm wondering if you have some recommendations as to how I can host this in the cloud in a cost effective manner. I've tried Vertex AI Model endpoint on Google Cloud, it's nice that it's easy to setup and add good GPU acceleration but it seems like it requires 1 instance and 1 gpu to always be allocated which quickly adds up to a cost of around 250$ a month because it runs 24/7/30.
Are you able to provide recommendations on a setup / service / platform where it can still run fast (probably needs gpu acceleration) but can spin down when not actively used. For my use case it might be that I need the endpoint to serve predictions every 5-15 minutes, and sometimes it may not be needed for 12 hours at all.
Thanks!
So I figured out the AI Platform has very basic support for some instances at the global endpoint to only run when requested, however those machine types are so limited in resources that my 246 mb model runs out of resources on them. And the n-xxx machine types to do not support scaling down to 0. Still haven't found a good cheap way to run in the cloud yet, just leaving this info here for other interested parties.
Kubernetes with the smallest machine that can handle your model + using spot instances should do the job.
Does what Pi suggested work for your use case? Here is the link for Spot VMs: https://cloud.google.com/kubernetes-engine/docs/concepts/spot-vms
I haven't testes spot vm's yet. But Amazon launched Serverless inference as a preview service and it works very well so far.
@supermoos,
As Serverless Inference Amazon SageMaker works for you, please let us know it this issue can be closed.
You can also try cluster or node pool with Spot VMs on GCP for cheap solution. Ref: Spot VMs
Thank you!