aphrodite-engine [Bug]: ModuleNotFoundError: No module named 'ray'

Your current environment

N/A

🐛 Describe the bug

Hello, Running the provided quickstart Docker run command, and getting the following error:

INFO: Multiprocessing frontend to use ipc:///tmp/3f2ae52b-cfde-4764-ad60-361c1c2ced18 for RPC Path. INFO: Started engine process with PID 57 Process SpawnProcess-1: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/aphrodite/executor/ray_utils.py", line 13, in import ray ModuleNotFoundError: No module named 'ray'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/aphrodite/endpoints/openai/rpc/server.py", line 214, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, rpc_path) File "/usr/local/lib/python3.10/dist-packages/aphrodite/endpoints/openai/rpc/server.py", line 29, in init self.engine = AsyncAphrodite.from_engine_args(async_engine_args) File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 703, in from_engine_args engine_config = engine_args.create_engine_config() File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/args_tools.py", line 936, in create_engine_config parallel_config = ParallelConfig( File "/usr/local/lib/python3.10/dist-packages/aphrodite/common/config.py", line 963, in init raise ValueError("Unable to load Ray which is " ValueError: Unable to load Ray which is required for multi-node inference, please install Ray with pip install ray. No module named 'ray'^CTraceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/aphrodite/endpoints/openai/api_server.py", line 802, in asyncio.run(run_server(args)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once event_list = self._selector.select(timeout) File "/usr/lib/python3.10/selectors.py", line 469, in select fd_event_list = self._selector.poll(timeout, max_ev) Thanks

Dec 02 '24 18:12 gizbo

Can you share your Docker command? We should not use Ray unless you launch the engine with --worker-use-ray or --distributed-executor-backend=ray.

Dec 02 '24 20:12 AlpinDale

Hey, thanks for the quick reply. I've was trying the command from the README.md Docker Additionally, we provide a Docker image for easy deployment. Here's a basic command to get you started:

docker run --runtime nvidia --gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface
#--env "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7"
-p 2242:2242
--ipc=host
alpindale/aphrodite-openai:latest
--model NousResearch/Meta-Llama-3.1-8B-Instruct
--tensor-parallel-size 8
--api-keys "sk-empty"

Dec 02 '24 20:12 gizbo

Can you add --distributed-executor-backend=mp to the launch flags?

Dec 03 '24 01:12 AlpinDale

By default getting same error

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7" \
    -p 2242:2242 \
    --ipc=host \
    alpindale/aphrodite-openai:latest \
    --model NousResearch/Meta-Llama-3.1-8B-Instruct \
    --tensor-parallel-size 8 \
    --api-keys "sk-empty"
Unable to find image 'alpindale/aphrodite-openai:latest' locally
latest: Pulling from alpindale/aphrodite-openai
3c645031de29: Pull complete
0d6448aff889: Pull complete
0a7674e3e8fe: Pull complete
b71b637b97c5: Pull complete
56dc85502937: Pull complete
c1c890480c74: Pull complete
93929e83ed21: Pull complete
0ead3d2f76c1: Pull complete
60cdee2e316d: Pull complete
518f3d7cac80: Pull complete
336c5995c4b2: Pull complete
Digest: sha256:8bac4170be255c19d29d84ffbdeabdc1b0a09ee511bec7ed0026e349db430357
Status: Downloaded newer image for alpindale/aphrodite-openai:latest
INFO:     Multiprocessing frontend to use
ipc:///tmp/535bb624-82bb-42e8-bbc7-5ea63814857e for RPC Path.
INFO:     Started engine process with PID 46
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/aphrodite/executor/ray_utils.py", line 13, in <module>
    import ray
ModuleNotFoundError: No module named 'ray'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/aphrodite/endpoints/openai/rpc/server.py", line 214, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/aphrodite/endpoints/openai/rpc/server.py", line 29, in __init__
    self.engine = AsyncAphrodite.from_engine_args(async_engine_args)
  File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 703, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/args_tools.py", line 936, in create_engine_config
    parallel_config = ParallelConfig(
  File "/usr/local/lib/python3.10/dist-packages/aphrodite/common/config.py", line 963, in __init__
    raise ValueError("Unable to load Ray which is "
ValueError: Unable to load Ray which is required for multi-node inference, please install Ray with `pip install ray`.

with --distributed-executor-backend=mp it seems to work, after i set the CUDA_VISIBLE_DEVICES=0 and --tensor-parallel-size 1

docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     --env "CUDA_VISIBLE_DEVICES=0"     -p 2242:2242     --ipc=host     alpindale/aphrodite-openai:latest     --model NousResearch/Meta-Llama-3.1-8B-Instruct     --tensor-parallel-size 1     --api-keys "sk-empty" --distributed-executor-backend=mp
INFO:     Multiprocessing frontend to use
ipc:///tmp/6613166f-863d-42db-98cc-5c78ae5f00a4 for RPC Path.
INFO:     Started engine process with PID 44
WARNING:  The model has a long context length (131072). This may cause OOM
errors during the initial memory profiling phase, or result in low performance
due to small KV cache space. Consider setting --max-model-len to a smaller
value.
INFO:
--------------------------------------------------------------------------------
-----
INFO:     Initializing Aphrodite Engine (v0.6.4.post1 commit 20f11fd0) with the
following config:
INFO:     Model = 'NousResearch/Meta-Llama-3.1-8B-Instruct'
INFO:     DataType = torch.bfloat16
INFO:     Tensor Parallel Size = 1
INFO:     Pipeline Parallel Size = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Context Length = 131072
INFO:     Enforce Eager Mode = False
INFO:     Prefix Caching = False
INFO:     Device = device(type='cuda')
INFO:     Guided Decoding Backend =
DecodingConfig(guided_decoding_backend='lm-format-enforcer')
INFO:
--------------------------------------------------------------------------------
-----
WARNING:  Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary
CPU contention. Set OMP_NUM_THREADS in the external environment to tune this
value as needed.
INFO:     Loading model NousResearch/Meta-Llama-3.1-8B-Instruct...
INFO:     Using model weights format ['*.safetensors']

Dec 12 '24 07:12 baditaflorin

The problem is that the docker example has --tensor-parallel-size 8 (so 8 GPUs from my understanding), and from the docs:

--distributed-executor-backend {ray,mp} Category: Parallel Options Backend to use for distributed serving. When more than 1 GPU is used, will be automatically set to "ray" if installed or "mp" (multiprocessing) otherwise.

But in the code, if ray is not found but multiple GPUs are requested, it throws the ValueError and does not switch to "mp". One simple "fix" would be to adjust the docker example at the main page to use only one GPU, with the note that ray is required for multiple GPUs. Another one would be to actually revert back to "mp" if ray is not found, as per the docs

Feb 13 '25 11:02 mahenning