BentoML
BentoML copied to clipboard
On-demand loading with limited GPU resource
Hi. Are bentoml planning to support on-demand loading? My use case is, that I have multiple models which can't be all loaded at the same time, so I want to load some, then, when demanded, the models will get swapped to process the requests
We have similar situation. The original problem is the GPU resources are not enough to load all the models at the same time, so we hope the services can auto-switch between models and just load the selected model and do some inferences on it.
I checked the https://docs.bentoml.com/en/latest/guides/scheduling.html , which doesn't mention what will happen when there is not enough resource for multi-runners.