OpenLLM icon indicating copy to clipboard operation
OpenLLM copied to clipboard

Can't pass workers_per_resource to the bentoml container

Open hahmad2008 opened this issue 1 year ago • 2 comments
trafficstars

Describe the bug

I have a machine with two GPUs, I run the model with openllm start command and everything went well. CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 openllm start mistral --model-id mymodel --dtype float16 --gpu-memory-utilization 0.95 --workers-per-resource 0.5

  • there are two process appear on the two GPUs in this case one for the service and another for ray instance.

when I run start command without --gpu-memory-utilization 0.95 --workers-per-resource 0.5, only one GPU is running the service and CUDA out of memory is occured.

When I build the image and follow the steps to create container, however when i run the docker image, it issue error of cuda out of memory, such as the second case without passing these args: --gpu-memory-utilization 0.95 --workers-per-resource 0.5

steps:

  • openllm build mymodel --backend vllm --serialization safetensors
  • bentoml containerize mymodel-service:12345 --opt progress=plain
  • docker run --rm --gpus all -p 3000:3000 -it mymodel-service:12345

To reproduce

No response

Logs

No response

Environment

$ bentoml -v bentoml, version 1.1.11

$openllm -v openllm, 0.4.45.dev2 (compiled: False) Python (CPython) 3.11.7

System information (Optional)

No response

hahmad2008 avatar Feb 12 '24 14:02 hahmad2008

@aarnphm What is the difference between the previous two cases, so the first case can launch two processes one for ray worker and other for bentoml service (that when using --gpu-memory-utilization 0.95 --workers-per-resource 0.5

hahmad2008 avatar Feb 12 '24 19:02 hahmad2008

Same issue: https://github.com/bentoml/OpenLLM/issues/872

jeremyadamsfisher avatar Feb 13 '24 00:02 jeremyadamsfisher

close for openllm 0.6

bojiang avatar Jul 12 '24 01:07 bojiang