DeepSpeed-MII
DeepSpeed-MII copied to clipboard
Limit VRAM usage in serving the model
is it possible to limit "max_memory" while serving the model ?
both for standart and openai serving
The problem is, i have tried to serve the model in two different card. Both on 3090 and rtx 6000 ada generation. Mosel serving ate up all the vram in both scenarios. I want to run an embedding model on the same gpu but it leaves no space.