[Installation]: VLLM on ARM machine with GH200
Your current environment
(I can not run collect_env since it requires VLLM installed)
$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
filelock==3.16.1
fsspec==2024.10.0
idna==3.4
Jinja2==3.1.4
MarkupSafe==3.0.2
mpmath==1.3.0
networkx==3.4.2
numpy==2.1.3
pillow==10.2.0
pynvml==11.5.3
requests==2.28.1
sympy==1.13.1
torch==2.5.1
typing_extensions==4.12.2
urllib3==1.26.13
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
I have an ARM CPU and a NVIDIA GH200 Driver Version: 550.90.07 CUDA Version: 12.4.
How you are installing vllm
pip install torch numpy
pip install vllm
I get this error:
pip install vllm
Collecting vllm
Using cached vllm-0.6.4.post1.tar.gz (3.1 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [18 lines of output]
/tmp/pip-build-env-8t3z_6ag/overlay/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
Traceback (most recent call last):
File "/hpi/fs00/home/philipp.hildebrandt/armpython/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/hpi/fs00/home/philipp.hildebrandt/armpython/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/hpi/fs00/home/philipp.hildebrandt/armpython/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-8t3z_6ag/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=[])
File "/tmp/pip-build-env-8t3z_6ag/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 304, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-8t3z_6ag/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 320, in run_setup
exec(code, locals())
File "<string>", line 526, in <module>
File "<string>", line 433, in get_vllm_version
RuntimeError: Unknown runtime environment
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
I thought numpy was missing or there was some problem with torch, which is why I manually installed numpy and torch in a fresh venv before trying this again. Torch has cuda available, but the error looks like VLLM might be trying to use a CPU backend. I tried manually installing pynvml, but it did not change anything.
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Phimanu Pytorch doesn't support Arm64+CUDA in the stable release, but you can now run it with the nightly version.
I just submitted a PR today (https://github.com/vllm-project/vllm/pull/10499) that updates the Dockerfile and adds a new requirements file specifically to fix this and allow for building a Arm64/GH200 version with CUDA from the main repo.
Side note: I've been maintaining a GH200 specific docker container of VLLM until the PR is merges if you want to try that (haven't exhaustively tested everything, but tried a couple different models and options to confirm general functionality): https://hub.docker.com/r/drikster80/vllm-gh200-openai/tags
Hey, I tried it with the nightly pytorch version and also your branch but still got the same error.
(vllm-arm) philipp.hildebrandt@ga01:~$ python
Python 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.6.0.dev20241120+cu124'
>>> print(torch.cuda.is_available())
True
>>> print(torch.version.cuda)
12.4
(vllm-arm) philipp.hildebrandt@ga01:~$ git clone https://github.com/drikster80/vllm.git
Cloning into 'vllm'...
remote: Enumerating objects: 41671, done.
remote: Counting objects: 100% (7682/7682), done.
remote: Compressing objects: 100% (487/487), done.
remote: Total 41671 (delta 7427), reused 7195 (delta 7195), pack-reused 33989 (from 1)
Receiving objects: 100% (41671/41671), 32.58 MiB | 21.70 MiB/s, done.
Resolving deltas: 100% (32302/32302), done.
(vllm-arm) philipp.hildebrandt@ga01:~$ cd vllm
(vllm-arm) philipp.hildebrandt@ga01:~/vllm$ pip install -e .
Obtaining file:///hpi/fs00/home/philipp.hildebrandt/vllm
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
error: subprocess-exited-with-error
× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
Traceback (most recent call last):
File "/hpi/fs00/home/philipp.hildebrandt/vllm-arm/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
main()
File "/hpi/fs00/home/philipp.hildebrandt/vllm-arm/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/hpi/fs00/home/philipp.hildebrandt/vllm-arm/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 144, in get_requires_for_build_editable
return hook(config_settings)
File "/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 483, in get_requires_for_build_editable
return self.get_requires_for_build_wheel(config_settings)
File "/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=[])
File "/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 304, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 320, in run_setup
exec(code, locals())
File "<string>", line 526, in <module>
File "<string>", line 433, in get_vllm_version
RuntimeError: Unknown runtime environment
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
I am not sure but is there some incorrect enviroment variable that makes vllm try to use numpy (CPU backend?)?
SHELL=/bin/bash
CONDA_EXE=<HOME_DIR>/miniconda3/bin/conda
_CE_M=
LMOD_arch=x86_64
TMUX=/tmp/tmux-9798/default,476541,2
LMOD_DIR=/usr/share/lmod/lmod/libexec
PWD=<HOME_DIR>
SLURM_GTIDS=0
LOGNAME=<USER>
XDG_SESSION_TYPE=tty
CONDA_PREFIX=<HOME_DIR>/miniconda3
SLURM_JOB_PARTITION=sorcery
MODULESHOME=/usr/share/lmod/lmod
MANPATH=/usr/share/lmod/lmod/share/man::
LMOD_PREPEND_BLOCK=normal
MOTD_SHOWN=pam
LANG=C.UTF-8
VIRTUAL_ENV=<HOME_DIR>/vllm-arm
CONDA_PROMPT_MODIFIER=(base)
TMPDIR=/tmp
LMOD_VERSION=6.6
MODULEPATH_ROOT=/usr/modulefiles
CUDA_VISIBLE_DEVICES=0
XDG_SESSION_CLASS=user
LMOD_PKG=/usr/share/lmod/lmod
TERM=screen
_CE_CONDA=
USER=<USER>
TMUX_PANE=%2
CONDA_SHLVL=1
LMOD_SETTARG_CMD=:
SHLVL=3
BASH_ENV=/usr/share/lmod/lmod/init/bash
LMOD_FULL_SETTARG_SUPPORT=no
LMOD_sys=Linux
XDG_SESSION_ID=4503
CONDA_PYTHON_EXE=<HOME_DIR>/miniconda3/bin/python
LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:
LMOD_COLORIZE=yes
XDG_RUNTIME_DIR=/run/user/9798
PS1=(vllm-arm) ${debian_chroot:+($debian_chroot)}\u@\h:\w\$
CONDA_DEFAULT_ENV=base
CUDA_HOME=/usr/local/cuda-12.4
PATH=<HOME_DIR>/vllm-arm/bin:/usr/local/cuda-12.4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:<HOME_DIR>/.local/bin:<HOME_DIR>/bin:<HOME_DIR>/.local/bin:<HOME_DIR>/bin
MODULEPATH=/etc/lmod/modules:/usr/share/lmod/lmod/modulefiles/
LMOD_CMD=/usr/share/lmod/lmod/libexec/lmod
SSH_TTY=/dev/pts/13
OLDPWD=<HOME_DIR>/vllm
SLURM_JOB_NODELIST=ga01
BASH_FUNC_ml%%=() { eval $($LMOD_DIR/ml_cmd "$@")
}
BASH_FUNC_module%%=() { eval $($LMOD_CMD bash "$@");
[ $? = 0 ] && eval $(${LMOD_SETTARG_CMD:-:} -s sh)
}
_=/usr/bin/env
To successfully run vLLM on the GH200, we followed these steps:
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
# Inside the container
$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # Currently, only the PyTorch nightly has wheels for aarch64 with CUDA.
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py # remove all vllm dependency specification of pytorch
$ pip install -r requirements-build.txt # install the rest build time dependency
$ pip install -vvv -e . --no-build-isolation # use --no-build-isolation to build with the current pytorch
# Install Triton otherwise throws Triton Module Not Found
$ git clone https://github.com/triton-lang/triton.git
$ cd triton
$ pip install ninja cmake wheel pybind11 # build-time dependencies
$ pip install -e python
To successfully run vLLM on the GH200, we followed these steps:
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3 # Inside the container $ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. $ git clone https://github.com/vllm-project/vllm.git $ cd vllm $ python use_existing_torch.py # remove all vllm dependency specification of pytorch $ pip install -r requirements-build.txt # install the rest build time dependency $ pip install -vvv -e . --no-build-isolation # use --no-build-isolation to build with the current pytorch # Install Triton otherwise throws Triton Module Not Found $ git clone https://github.com/triton-lang/triton.git $ cd triton $ pip install ninja cmake wheel pybind11 # build-time dependencies $ pip install -e python
you can use the same scripts of jetson-containers, only use the docker for SBSA. I mean for Nvidia is: ARM: Jetson and future arm laptops. SBSA: Grace https://github.com/dusty-nv/jetson-containers
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 May I ask what version of the vllm you compiled? I haven't been able to compile the latest version on my end
officially now pytorch support aarch wheels if not you can build it with github arm runners and cuda binaries. I ported to use github arm runners and SBSA CUDA https://github.com/Jimver/cuda-toolkit/releases/tag/v0.2.21
@drikster80 when do you plan to release v0.8.1 for GH200?
@drikster80 when do you plan to release v0.8.1 for GH200?
Working on it today. There are a lot of changes in this version, and I want to keep parity with the x86_64 version, so need do a lot of testing before sending it up to dockerhub. Trying to get it pushed up today, but if some problems are encountered, might be a day or so.
Update: The patches are for building v0.8.1 on GH200 are up on my branch, but having issues with running on pytorch-nightly right now.
Good news: LamdaLabs is now building and publishing a GH200 specific version vllm image, as well as the assoicated Dockerfile. I did a quick test and it works as expected.
Of note, the LamdaLabs container uses Ubuntu 24.04, CUDA 12.6.3, and has a couple packages removed. This is a deviation from upstream (in a good way, IMHO), but the differences could lead to some edge-case problems. The version I typically publish attempts to maintain 1:1 package compatibility with upstream. I'll still push it up once I get the pytorch problem sorted, but I do recommend using the LamdaLabs version if you are using GH200.
Thanks @drikster80 much appreciated
Update: I spoke too early. It works for some configs, but having issues with xformers and missing flashinfer. Still working on fixing that.
I was able to get it to work by building triton 3.2.x from source.
Almost working image for GH200 for vllm 0.8.1:
substratusai/vllm-gh200:v0.8.1
Example docker run:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
substratusai/vllm-gh200:v0.8.1 \
--model Qwen/Qwen2.5-Coder-32B-Instruct
Code changes made on top of v0.8.1 are here: https://github.com/substratusai/vllm/tree/v0.8.1-gh200
I will try to make a PR if main still has this issue too.
I got 0.8.2 working with flash-infer as well:
substratusai/vllm-gh200:v0.8.2
Example docker runs:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
substratusai/vllm-gh200:v0.8.1 \
--model Qwen/Qwen2.5-Coder-32B-Instruct
# Test same model with flashinfer
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 -d \
--name vllm-test \
--ipc=host \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
${IMAGE} \
--model Qwen/Qwen2.5-Coder-32B-Instruct
I will do some more runs on KubeAI with various models for further testing. Please give this image a try as well. edit: tests on KubeAI were fine. Publishing the new image for KubeAI users.
The source for my image is here: https://github.com/substratusai/vllm-docker/blob/gh200/Dockerfile.cuda-arm It's based on the lambda labs Dockerfile as mentioned in the source of my Dockerfile.
It seems the upstream Dockerfile is quite broken for GH200. I've spent the last few days messing around with it.
I was able to build a working vLLM 0.8.1 image based on the LambdaLabs Dockerfile as well. I was further able to include LMCache for CPU offloading support on the GH200. The corresponding Dockerfile.
Hey @rajesh-s , Thank you so much for providing this. In your Dockerfile, you need to pip install numpy<2.0.0 as it is by default installing 2.2+. Making this change leads to successfully running.
Great, Rajesh, Let me know how offloading experienced go for you. I saw lots of no memory available for offload errors although there was plenty of free memory
On Wed, Apr 2, 2025, 1:46 AM Lakshya A Agrawal @.***> wrote:
Hey @rajesh-s https://github.com/rajesh-s , Thank you so much for providing this. In your Dockerfile, you need to pip install numpy<2.0.0 as it is by default installing 2.2+. Making this change leads to successfully running.
— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/10459#issuecomment-2771490475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATNG35GCPE3M435O6DIFUL2XOBTBAVCNFSM6AAAAABSCTATKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONZRGQ4TANBXGU . You are receiving this because you commented.Message ID: @.***> [image: LakshyAAAgrawal]LakshyAAAgrawal left a comment (vllm-project/vllm#10459) https://github.com/vllm-project/vllm/issues/10459#issuecomment-2771490475
Hey @rajesh-s https://github.com/rajesh-s , Thank you so much for providing this. In your Dockerfile, you need to pip install numpy<2.0.0 as it is by default installing 2.2+. Making this change leads to successfully running.
— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/10459#issuecomment-2771490475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATNG35GCPE3M435O6DIFUL2XOBTBAVCNFSM6AAAAABSCTATKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONZRGQ4TANBXGU . You are receiving this because you commented.Message ID: @.***>
I build a new image and did basic tests: substratusai/vllm-gh200:v0.8.3
Please give it a try.
To successfully run vLLM on the GH200, we followed these steps:
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3 # Inside the container $ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. $ git clone https://github.com/vllm-project/vllm.git $ cd vllm $ python use_existing_torch.py # remove all vllm dependency specification of pytorch $ pip install -r requirements-build.txt # install the rest build time dependency $ pip install -vvv -e . --no-build-isolation # use --no-build-isolation to build with the current pytorch # Install Triton otherwise throws Triton Module Not Found $ git clone https://github.com/triton-lang/triton.git $ cd triton $ pip install ninja cmake wheel pybind11 # build-time dependencies $ pip install -e pythonyou can use the same scripts of jetson-containers, only use the docker for SBSA. I mean for Nvidia is: ARM: Jetson and future arm laptops. SBSA: Grace https://github.com/dusty-nv/jetson-containers
Now I officially support sbsa on jetson-containers, wheels will be here: https://pypi.jetson-ai-lab.dev/sbsa/cu128
Solved:
pip3 install xgrammar vllm --index-url https://pypi.jetson-ai-lab.dev/sbsa/cu128/
@johnnynunez, thanks for the wheels. Using
pip3 install xgrammar vllm --index-url https://pypi.jetson-ai-lab.dev/sbsa/cu128/I am getting an error thatTraceback (most recent call last): File "", line 1, in File "/home1/09883/parikshitb52/.local/lib/python3.10/site-packages/vllm/init.py", line 10, in import vllm.env_override # isort:skip # noqa: F401 File "/home1/09883/parikshitb52/.local/lib/python3.10/site-packages/vllm/env_override.py", line 21, in torch._inductor.config.compile_threads = 1 AttributeError: module 'torch._inductor' has no attribute 'config'
I am using the docker image : nvcr.io/nvidia/pytorch:23.10-py3 Is this expected?
Thanks!
I don’t know. You have to use docker compatible with sbsa
Is it able to install vllm on ARM machine with GH200 without docker (since no sudo access)?
Is it able to install vllm on ARM machine with GH200 without docker (since no sudo access)?
yes
Thank you for the reply. Does it support CUDA 12.6? I created a new environment using conda with python==3.10, then directly ran the command you provided:
pip3 install xgrammar vllm --index-url https://pypi.jetson-ai-lab.dev/sbsa/cu128/
After that, I installed libstdc++.so.6 via conda using:
conda install -n vllm -c conda-forge libstdcxx-ng
Now I’m able to import both vllm and triton. However, when I try to initialize the model, I still encounter this error:
......
# File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:121 in forward_native, code: residual = x.to(orig_dtype)
to_6: "bf16[s0, 2048][2048, 1]" = add_3.to(torch.bfloat16); to_6 = None
# File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:138 in forward_native, code: variance = x_var.pow(2).mean(dim=-1, keepdim=True)
pow_2: "f32[s0, 2048][2048, 1]" = add_3.pow(2)
mean_1: "f32[s0, 1][1, 1]" = pow_2.mean(dim = -1, keepdim = True); pow_2 = None
# File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:140 in forward_native, code: x = x * torch.rsqrt(variance + self.variance_epsilon)
add_4: "f32[s0, 1][1, 1]" = mean_1 + 1e-06; mean_1 = None
rsqrt_1: "f32[s0, 1][1, 1]" = torch.rsqrt(add_4); add_4 = None
mul_4: "f32[s0, 2048][2048, 1]" = add_3 * rsqrt_1; add_3 = rsqrt_1 = None
# File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:141 in forward_native, code: x = x.to(orig_dtype)
to_7: "bf16[s0, 2048][2048, 1]" = mul_4.to(torch.bfloat16); mul_4 = None
# File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:143 in forward_native, code: x = x * self.weight
mul_5: "bf16[s0, 2048][2048, 1]" = to_7 * l_self_modules_norm_parameters_weight_; to_7 = l_self_modules_norm_parameters_weight_ = None
return mul_5
Original traceback:
None
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
Traceback (most recent call last):
File "/scratch/10436/jl77863/LLaMA-Factory/myscripts/eval.py", line 374, in <module>
main()
File "/scratch/10436/jl77863/LLaMA-Factory/myscripts/eval.py", line 255, in main
model = LLM(
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 1149, in inner
return fn(*args, **kwargs)
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 248, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 522, in from_engine_args
return engine_cls.from_vllm_config(
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 115, in from_vllm_config
return cls(vllm_config=vllm_config,
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 90, in __init__
self.engine_core = EngineCoreClient.make_client(
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 71, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__
super().__init__(
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 395, in __init__
self._wait_for_engine_startup()
File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 421, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
Thank you for the reply. Does it support CUDA 12.6? I created a new environment using conda with
python==3.10, then directly ran the command you provided:pip3 install xgrammar vllm --index-url https://pypi.jetson-ai-lab.dev/sbsa/cu128/ After that, I installed
libstdc++.so.6via conda using:conda install -n vllm -c conda-forge libstdcxx-ng Now I’m able to import both
vllmandtriton. However, when I try to initialize the model, I still encounter this error:...... # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:121 in forward_native, code: residual = x.to(orig_dtype) to_6: "bf16[s0, 2048][2048, 1]" = add_3.to(torch.bfloat16); to_6 = None # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:138 in forward_native, code: variance = x_var.pow(2).mean(dim=-1, keepdim=True) pow_2: "f32[s0, 2048][2048, 1]" = add_3.pow(2) mean_1: "f32[s0, 1][1, 1]" = pow_2.mean(dim = -1, keepdim = True); pow_2 = None # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:140 in forward_native, code: x = x * torch.rsqrt(variance + self.variance_epsilon) add_4: "f32[s0, 1][1, 1]" = mean_1 + 1e-06; mean_1 = None rsqrt_1: "f32[s0, 1][1, 1]" = torch.rsqrt(add_4); add_4 = None mul_4: "f32[s0, 2048][2048, 1]" = add_3 * rsqrt_1; add_3 = rsqrt_1 = None # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:141 in forward_native, code: x = x.to(orig_dtype) to_7: "bf16[s0, 2048][2048, 1]" = mul_4.to(torch.bfloat16); mul_4 = None # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:143 in forward_native, code: x = x * self.weight mul_5: "bf16[s0, 2048][2048, 1]" = to_7 * l_self_modules_norm_parameters_weight_; to_7 = l_self_modules_norm_parameters_weight_ = None return mul_5 Original traceback: None Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" Traceback (most recent call last): File "/scratch/10436/jl77863/LLaMA-Factory/myscripts/eval.py", line 374, in <module> main() File "/scratch/10436/jl77863/LLaMA-Factory/myscripts/eval.py", line 255, in main model = LLM( File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 1149, in inner return fn(*args, **kwargs) File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 248, in __init__ self.llm_engine = LLMEngine.from_engine_args( File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 522, in from_engine_args return engine_cls.from_vllm_config( File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 115, in from_vllm_config return cls(vllm_config=vllm_config, File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 90, in __init__ self.engine_core = EngineCoreClient.make_client( File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 71, in make_client return SyncMPClient(vllm_config, executor_class, log_stats) File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__ super().__init__( File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 395, in __init__ self._wait_for_engine_startup() File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 421, in _wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above.
Wheels are with cu128: ubuntu22.04 python3.10 ubuntu24.04 python3.12 ubuntu24.04 python3.13
The PyTorch on https://pypi.jetson-ai-lab.dev/sbsa/cu128/ isn't built with CUDA which necessitates first installing PyTorch with index url https://download.pytorch.org/whl/cu128 and then installing vllm. Is this a mistake?
does this issue already being FIX for the current vLLM latest version?
vLLM 0.9.0.1 with latest LMCache 0.3.0 integrated for GH200
docker pull rajesh550/gh200-vllm:0.9.0.1
Dockerfile: https://github.com/rajesh-s/containers-for-gh200/blob/main/vllm/Dockerfile
Hello All,
Is there a target date for releasing aarch64 wheels via official pypi releases? As far as I see, we only provide x86_64 wheels today. Thanks!
how can i install vllm on ARM machine with RTX 4090?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!