OpenLLM
OpenLLM copied to clipboard
Can't pass workers_per_resource to the bentoml container
Describe the bug
I have a machine with two GPUs, I run the model with openllm start command and everything went well.
CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 openllm start mistral --model-id mymodel --dtype float16 --gpu-memory-utilization 0.95 --workers-per-resource 0.5
- there are two process appear on the two GPUs in this case one for the service and another for ray instance.
when I run start command without
--gpu-memory-utilization 0.95 --workers-per-resource 0.5, only one GPU is running the service and CUDA out of memory is occured.
When I build the image and follow the steps to create container, however when i run the docker image, it issue error of cuda out of memory, such as the second case without passing these args: --gpu-memory-utilization 0.95 --workers-per-resource 0.5
steps:
openllm build mymodel --backend vllm --serialization safetensorsbentoml containerize mymodel-service:12345 --opt progress=plaindocker run --rm --gpus all -p 3000:3000 -it mymodel-service:12345
To reproduce
No response
Logs
No response
Environment
$ bentoml -v bentoml, version 1.1.11
$openllm -v openllm, 0.4.45.dev2 (compiled: False) Python (CPython) 3.11.7
System information (Optional)
No response
@aarnphm What is the difference between the previous two cases, so the first case can launch two processes one for ray worker and other for bentoml service (that when using --gpu-memory-utilization 0.95 --workers-per-resource 0.5
Same issue: https://github.com/bentoml/OpenLLM/issues/872
close for openllm 0.6