BentoML On-demand loading with limited GPU resource

On-demand loading with limited GPU resource

Open shenxiangzhuang opened this issue 7 months ago • 0 comments

Hi. Are bentoml planning to support on-demand loading? My use case is, that I have multiple models which can't be all loaded at the same time, so I want to load some, then, when demanded, the models will get swapped to process the requests

Slack Message

We have similar situation. The original problem is the GPU resources are not enough to load all the models at the same time, so we hope the services can auto-switch between models and just load the selected model and do some inferences on it.

I checked the https://docs.bentoml.com/en/latest/guides/scheduling.html , which doesn't mention what will happen when there is not enough resource for multi-runners.

Slack Message

Dec 06 '23 06:12 shenxiangzhuang

BentoML BentoML copied to clipboard

On-demand loading with limited GPU resource

BentoML
BentoML copied to clipboard