vllm [Installation]: VLLM on ARM machine with GH200

Your current environment

(I can not run collect_env since it requires VLLM installed)

$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
filelock==3.16.1
fsspec==2024.10.0
idna==3.4
Jinja2==3.1.4
MarkupSafe==3.0.2
mpmath==1.3.0
networkx==3.4.2
numpy==2.1.3
pillow==10.2.0
pynvml==11.5.3
requests==2.28.1
sympy==1.13.1
torch==2.5.1
typing_extensions==4.12.2
urllib3==1.26.13

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

I have an ARM CPU and a NVIDIA GH200 Driver Version: 550.90.07 CUDA Version: 12.4.

How you are installing vllm

pip install torch numpy
pip install vllm

I get this error:

pip install vllm
Collecting vllm
  Using cached vllm-0.6.4.post1.tar.gz (3.1 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      /tmp/pip-build-env-8t3z_6ag/overlay/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
        cpu = _conversion_method_template(device=torch.device("cpu"))
      Traceback (most recent call last):
        File "/hpi/fs00/home/philipp.hildebrandt/armpython/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/hpi/fs00/home/philipp.hildebrandt/armpython/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/hpi/fs00/home/philipp.hildebrandt/armpython/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-8t3z_6ag/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
        File "/tmp/pip-build-env-8t3z_6ag/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 304, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-8t3z_6ag/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 320, in run_setup
          exec(code, locals())
        File "<string>", line 526, in <module>
        File "<string>", line 433, in get_vllm_version
      RuntimeError: Unknown runtime environment
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

I thought numpy was missing or there was some problem with torch, which is why I manually installed numpy and torch in a fresh venv before trying this again. Torch has cuda available, but the error looks like VLLM might be trying to use a CPU backend. I tried manually installing pynvml, but it did not change anything.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Nov 19 '24 16:11 Phimanu

@Phimanu Pytorch doesn't support Arm64+CUDA in the stable release, but you can now run it with the nightly version.

I just submitted a PR today (https://github.com/vllm-project/vllm/pull/10499) that updates the Dockerfile and adds a new requirements file specifically to fix this and allow for building a Arm64/GH200 version with CUDA from the main repo.

Side note: I've been maintaining a GH200 specific docker container of VLLM until the PR is merges if you want to try that (haven't exhaustively tested everything, but tried a couple different models and options to confirm general functionality): https://hub.docker.com/r/drikster80/vllm-gh200-openai/tags

Nov 20 '24 19:11 drikster80

Hey, I tried it with the nightly pytorch version and also your branch but still got the same error.

(vllm-arm) philipp.hildebrandt@ga01:~$ python
Python 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.6.0.dev20241120+cu124'
>>> print(torch.cuda.is_available())
True
>>> print(torch.version.cuda)
12.4

(vllm-arm) philipp.hildebrandt@ga01:~$ git clone https://github.com/drikster80/vllm.git
Cloning into 'vllm'...
remote: Enumerating objects: 41671, done.
remote: Counting objects: 100% (7682/7682), done.
remote: Compressing objects: 100% (487/487), done.
remote: Total 41671 (delta 7427), reused 7195 (delta 7195), pack-reused 33989 (from 1)
Receiving objects: 100% (41671/41671), 32.58 MiB | 21.70 MiB/s, done.
Resolving deltas: 100% (32302/32302), done.
(vllm-arm) philipp.hildebrandt@ga01:~$ cd vllm
(vllm-arm) philipp.hildebrandt@ga01:~/vllm$ pip install -e .
Obtaining file:///hpi/fs00/home/philipp.hildebrandt/vllm
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build editable did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      /tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
        cpu = _conversion_method_template(device=torch.device("cpu"))
      Traceback (most recent call last):
        File "/hpi/fs00/home/philipp.hildebrandt/vllm-arm/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
          main()
        File "/hpi/fs00/home/philipp.hildebrandt/vllm-arm/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/hpi/fs00/home/philipp.hildebrandt/vllm-arm/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 144, in get_requires_for_build_editable
          return hook(config_settings)
        File "/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 483, in get_requires_for_build_editable
          return self.get_requires_for_build_wheel(config_settings)
        File "/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=[])
        File "/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 304, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-s6eeaoxg/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 320, in run_setup
          exec(code, locals())
        File "<string>", line 526, in <module>
        File "<string>", line 433, in get_vllm_version
      RuntimeError: Unknown runtime environment
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

I am not sure but is there some incorrect enviroment variable that makes vllm try to use numpy (CPU backend?)?

SHELL=/bin/bash
CONDA_EXE=<HOME_DIR>/miniconda3/bin/conda
_CE_M=
LMOD_arch=x86_64
TMUX=/tmp/tmux-9798/default,476541,2
LMOD_DIR=/usr/share/lmod/lmod/libexec
PWD=<HOME_DIR>
SLURM_GTIDS=0
LOGNAME=<USER>
XDG_SESSION_TYPE=tty
CONDA_PREFIX=<HOME_DIR>/miniconda3
SLURM_JOB_PARTITION=sorcery
MODULESHOME=/usr/share/lmod/lmod
MANPATH=/usr/share/lmod/lmod/share/man::
LMOD_PREPEND_BLOCK=normal
MOTD_SHOWN=pam
LANG=C.UTF-8
VIRTUAL_ENV=<HOME_DIR>/vllm-arm
CONDA_PROMPT_MODIFIER=(base)
TMPDIR=/tmp
LMOD_VERSION=6.6
MODULEPATH_ROOT=/usr/modulefiles
CUDA_VISIBLE_DEVICES=0
XDG_SESSION_CLASS=user
LMOD_PKG=/usr/share/lmod/lmod
TERM=screen
_CE_CONDA=
USER=<USER>
TMUX_PANE=%2
CONDA_SHLVL=1
LMOD_SETTARG_CMD=:
SHLVL=3
BASH_ENV=/usr/share/lmod/lmod/init/bash
LMOD_FULL_SETTARG_SUPPORT=no
LMOD_sys=Linux
XDG_SESSION_ID=4503
CONDA_PYTHON_EXE=<HOME_DIR>/miniconda3/bin/python
LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:
LMOD_COLORIZE=yes
XDG_RUNTIME_DIR=/run/user/9798
PS1=(vllm-arm) ${debian_chroot:+($debian_chroot)}\u@\h:\w\$
CONDA_DEFAULT_ENV=base
CUDA_HOME=/usr/local/cuda-12.4
PATH=<HOME_DIR>/vllm-arm/bin:/usr/local/cuda-12.4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:<HOME_DIR>/.local/bin:<HOME_DIR>/bin:<HOME_DIR>/.local/bin:<HOME_DIR>/bin
MODULEPATH=/etc/lmod/modules:/usr/share/lmod/lmod/modulefiles/
LMOD_CMD=/usr/share/lmod/lmod/libexec/lmod
SSH_TTY=/dev/pts/13
OLDPWD=<HOME_DIR>/vllm
SLURM_JOB_NODELIST=ga01
BASH_FUNC_ml%%=() {  eval $($LMOD_DIR/ml_cmd "$@")
}
BASH_FUNC_module%%=() {  eval $($LMOD_CMD bash "$@"); 
 [ $? = 0 ] && eval $(${LMOD_SETTARG_CMD:-:} -s sh)
}
_=/usr/bin/env

Nov 23 '24 13:11 Phimanu

To successfully run vLLM on the GH200, we followed these steps:

docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

# Inside the container
$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # Currently, only the PyTorch nightly has wheels for aarch64 with CUDA.
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py # remove all vllm dependency specification of pytorch
$ pip install -r requirements-build.txt # install the rest build time dependency
$ pip install -vvv -e . --no-build-isolation # use --no-build-isolation to build with the current pytorch

# Install Triton otherwise throws Triton Module Not Found
$ git clone https://github.com/triton-lang/triton.git
$ cd triton
$ pip install ninja cmake wheel pybind11 # build-time dependencies
$ pip install -e python

Dec 24 '24 12:12 Bihan

To successfully run vLLM on the GH200, we followed these steps:

docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

# Inside the container
$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # Currently, only the PyTorch nightly has wheels for aarch64 with CUDA.
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py # remove all vllm dependency specification of pytorch
$ pip install -r requirements-build.txt # install the rest build time dependency
$ pip install -vvv -e . --no-build-isolation # use --no-build-isolation to build with the current pytorch

# Install Triton otherwise throws Triton Module Not Found
$ git clone https://github.com/triton-lang/triton.git
$ cd triton
$ pip install ninja cmake wheel pybind11 # build-time dependencies
$ pip install -e python

you can use the same scripts of jetson-containers, only use the docker for SBSA. I mean for Nvidia is: ARM: Jetson and future arm laptops. SBSA: Grace https://github.com/dusty-nv/jetson-containers

Dec 31 '24 11:12 johnnynunez

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 May I ask what version of the vllm you compiled? I haven't been able to compile the latest version on my end

Feb 25 '25 04:02 1556900941lizerui

officially now pytorch support aarch wheels if not you can build it with github arm runners and cuda binaries. I ported to use github arm runners and SBSA CUDA https://github.com/Jimver/cuda-toolkit/releases/tag/v0.2.21

Feb 25 '25 09:02 johnnynunez

@drikster80 when do you plan to release v0.8.1 for GH200?

Mar 20 '25 16:03 khayamgondal

@drikster80 when do you plan to release v0.8.1 for GH200?

Working on it today. There are a lot of changes in this version, and I want to keep parity with the x86_64 version, so need do a lot of testing before sending it up to dockerhub. Trying to get it pushed up today, but if some problems are encountered, might be a day or so.

Mar 20 '25 18:03 drikster80

Update: The patches are for building v0.8.1 on GH200 are up on my branch, but having issues with running on pytorch-nightly right now.

Good news: LamdaLabs is now building and publishing a GH200 specific version vllm image, as well as the assoicated Dockerfile. I did a quick test and it works as expected.

Of note, the LamdaLabs container uses Ubuntu 24.04, CUDA 12.6.3, and has a couple packages removed. This is a deviation from upstream (in a good way, IMHO), but the differences could lead to some edge-case problems. The version I typically publish attempts to maintain 1:1 package compatibility with upstream. I'll still push it up once I get the pytorch problem sorted, but I do recommend using the LamdaLabs version if you are using GH200.

Mar 21 '25 15:03 drikster80

Thanks @drikster80 much appreciated

Mar 21 '25 16:03 khayamgondal

Update: I spoke too early. It works for some configs, but having issues with xformers and missing flashinfer. Still working on fixing that.

I was able to get it to work by building triton 3.2.x from source.

Almost working image for GH200 for vllm 0.8.1:

substratusai/vllm-gh200:v0.8.1

Example docker run:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
    substratusai/vllm-gh200:v0.8.1 \
    --model Qwen/Qwen2.5-Coder-32B-Instruct

Code changes made on top of v0.8.1 are here: https://github.com/substratusai/vllm/tree/v0.8.1-gh200

I will try to make a PR if main still has this issue too.

Mar 22 '25 05:03 samos123

I got 0.8.2 working with flash-infer as well:

substratusai/vllm-gh200:v0.8.2

Example docker runs:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
    substratusai/vllm-gh200:v0.8.1 \
    --model Qwen/Qwen2.5-Coder-32B-Instruct

# Test same model with flashinfer
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 -d \
    --name vllm-test \
    --ipc=host \
    -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
    -e VLLM_ATTENTION_BACKEND=FLASHINFER \
    ${IMAGE} \
    --model Qwen/Qwen2.5-Coder-32B-Instruct

I will do some more runs on KubeAI with various models for further testing. Please give this image a try as well. edit: tests on KubeAI were fine. Publishing the new image for KubeAI users.

The source for my image is here: https://github.com/substratusai/vllm-docker/blob/gh200/Dockerfile.cuda-arm It's based on the lambda labs Dockerfile as mentioned in the source of my Dockerfile.

It seems the upstream Dockerfile is quite broken for GH200. I've spent the last few days messing around with it.

Mar 28 '25 00:03 samos123

I was able to build a working vLLM 0.8.1 image based on the LambdaLabs Dockerfile as well. I was further able to include LMCache for CPU offloading support on the GH200. The corresponding Dockerfile.

Apr 02 '25 04:04 rajesh-s

Hey @rajesh-s , Thank you so much for providing this. In your Dockerfile, you need to pip install numpy<2.0.0 as it is by default installing 2.2+. Making this change leads to successfully running.

Apr 02 '25 06:04 LakshyAAAgrawal

Great, Rajesh, Let me know how offloading experienced go for you. I saw lots of no memory available for offload errors although there was plenty of free memory

On Wed, Apr 2, 2025, 1:46 AM Lakshya A Agrawal @.***> wrote:

Hey @rajesh-s https://github.com/rajesh-s , Thank you so much for providing this. In your Dockerfile, you need to pip install numpy<2.0.0 as it is by default installing 2.2+. Making this change leads to successfully running.

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/10459#issuecomment-2771490475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATNG35GCPE3M435O6DIFUL2XOBTBAVCNFSM6AAAAABSCTATKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONZRGQ4TANBXGU . You are receiving this because you commented.Message ID: @.***> [image: LakshyAAAgrawal]LakshyAAAgrawal left a comment (vllm-project/vllm#10459) https://github.com/vllm-project/vllm/issues/10459#issuecomment-2771490475

Hey @rajesh-s https://github.com/rajesh-s , Thank you so much for providing this. In your Dockerfile, you need to pip install numpy<2.0.0 as it is by default installing 2.2+. Making this change leads to successfully running.

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/10459#issuecomment-2771490475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATNG35GCPE3M435O6DIFUL2XOBTBAVCNFSM6AAAAABSCTATKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONZRGQ4TANBXGU . You are receiving this because you commented.Message ID: @.***>

Apr 02 '25 12:04 khayamgondal

I build a new image and did basic tests: substratusai/vllm-gh200:v0.8.3

Please give it a try.

Apr 09 '25 02:04 samos123

To successfully run vLLM on the GH200, we followed these steps:

docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

# Inside the container
$ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 # Currently, only the PyTorch nightly has wheels for aarch64 with CUDA.
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py # remove all vllm dependency specification of pytorch
$ pip install -r requirements-build.txt # install the rest build time dependency
$ pip install -vvv -e . --no-build-isolation # use --no-build-isolation to build with the current pytorch

# Install Triton otherwise throws Triton Module Not Found
$ git clone https://github.com/triton-lang/triton.git
$ cd triton
$ pip install ninja cmake wheel pybind11 # build-time dependencies
$ pip install -e python

you can use the same scripts of jetson-containers, only use the docker for SBSA. I mean for Nvidia is: ARM: Jetson and future arm laptops. SBSA: Grace https://github.com/dusty-nv/jetson-containers

Now I officially support sbsa on jetson-containers, wheels will be here: https://pypi.jetson-ai-lab.dev/sbsa/cu128

Apr 09 '25 23:04 johnnynunez

Solved:

pip3 install xgrammar vllm --index-url https://pypi.jetson-ai-lab.dev/sbsa/cu128/

Apr 12 '25 02:04 johnnynunez

@johnnynunez, thanks for the wheels. Using pip3 install xgrammar vllm --index-url https://pypi.jetson-ai-lab.dev/sbsa/cu128/ I am getting an error that

Traceback (most recent call last): File "", line 1, in File "/home1/09883/parikshitb52/.local/lib/python3.10/site-packages/vllm/init.py", line 10, in import vllm.env_override # isort:skip # noqa: F401 File "/home1/09883/parikshitb52/.local/lib/python3.10/site-packages/vllm/env_override.py", line 21, in torch._inductor.config.compile_threads = 1 AttributeError: module 'torch._inductor' has no attribute 'config'

I am using the docker image : nvcr.io/nvidia/pytorch:23.10-py3 Is this expected?

Thanks!

I don’t know. You have to use docker compatible with sbsa

Apr 18 '25 06:04 johnnynunez

Is it able to install vllm on ARM machine with GH200 without docker (since no sudo access)?

Apr 20 '25 04:04 ljb121002

Is it able to install vllm on ARM machine with GH200 without docker (since no sudo access)?

yes

Apr 20 '25 08:04 johnnynunez

Thank you for the reply. Does it support CUDA 12.6? I created a new environment using conda with python==3.10, then directly ran the command you provided:

pip3 install xgrammar vllm --index-url https://pypi.jetson-ai-lab.dev/sbsa/cu128/

After that, I installed libstdc++.so.6 via conda using:

conda install -n vllm -c conda-forge libstdcxx-ng

Now I’m able to import both vllm and triton. However, when I try to initialize the model, I still encounter this error:

 ......           
  # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:121 in forward_native, code: residual = x.to(orig_dtype)                                                                                
            to_6: "bf16[s0, 2048][2048, 1]" = add_3.to(torch.bfloat16);  to_6 = None                                                   
                                                                                                                                                                                                                                                                              
             # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:138 in forward_native, code: variance = x_var.pow(2).mean(dim=-1, keepdim=True)                                                         
            pow_2: "f32[s0, 2048][2048, 1]" = add_3.pow(2)   
            mean_1: "f32[s0, 1][1, 1]" = pow_2.mean(dim = -1, keepdim = True);  pow_2 = None                                                                                                                                                                                  
                               
             # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:140 in forward_native, code: x = x * torch.rsqrt(variance + self.variance_epsilon)                                                      
            add_4: "f32[s0, 1][1, 1]" = mean_1 + 1e-06;  mean_1 = None                                                                                                                                                                                                        
            rsqrt_1: "f32[s0, 1][1, 1]" = torch.rsqrt(add_4);  add_4 = None                                                            
            mul_4: "f32[s0, 2048][2048, 1]" = add_3 * rsqrt_1;  add_3 = rsqrt_1 = None                                                 
                                                                                                                                                                                                                                                                              
             # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:141 in forward_native, code: x = x.to(orig_dtype)                                                                                       
            to_7: "bf16[s0, 2048][2048, 1]" = mul_4.to(torch.bfloat16);  mul_4 = None                                                  
                                                                                                                                       
             # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:143 in forward_native, code: x = x * self.weight                                                                                        
            mul_5: "bf16[s0, 2048][2048, 1]" = to_7 * l_self_modules_norm_parameters_weight_;  to_7 = l_self_modules_norm_parameters_weight_ = None                                                                                                                           
            return mul_5                                                                                                               
                                                                                                                                       
                                                                                                                                       
Original traceback:                        
None                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                              
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"                                                                                            
                                                                                                                                       
Traceback (most recent call last):                                                                                                                                                                                                                                            
  File "/scratch/10436/jl77863/LLaMA-Factory/myscripts/eval.py", line 374, in <module>                                                                                                                                                                                        
    main()                                                                                                                             
  File "/scratch/10436/jl77863/LLaMA-Factory/myscripts/eval.py", line 255, in main                                                                                                                                                                                            
    model = LLM(                                                                                                                       
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 1149, in inner                   
    return fn(*args, **kwargs)        
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 248, in __init__                                                                                                                                              
    self.llm_engine = LLMEngine.from_engine_args(                                                                                                                                                                                                                             
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 522, in from_engine_args                                                                                                                                    
    return engine_cls.from_vllm_config(                                                                                                                                                                                                                                       
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 115, in from_vllm_config                                                                                                                                 
    return cls(vllm_config=vllm_config,
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 90, in __init__                                                                                                                                          
    self.engine_core = EngineCoreClient.make_client(                                                                                   
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 71, in make_client                                                                                                                                      
    return SyncMPClient(vllm_config, executor_class, log_stats)                                                                                                                                                                                                               
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__ 
    super().__init__(                                                                                                                  
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 395, in __init__                                                                                                                                        
    self._wait_for_engine_startup()                                                                                                    
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 421, in _wait_for_engine_startup                                                                                                                        
    raise RuntimeError("Engine core initialization failed. "                                                                                                                                                                                                                  
RuntimeError: Engine core initialization failed. See root cause above.

Apr 20 '25 20:04 ljb121002

Thank you for the reply. Does it support CUDA 12.6? I created a new environment using conda with python==3.10, then directly ran the command you provided:

pip3 install xgrammar vllm --index-url https://pypi.jetson-ai-lab.dev/sbsa/cu128/ After that, I installed libstdc++.so.6 via conda using:

conda install -n vllm -c conda-forge libstdcxx-ng Now I’m able to import both vllm and triton. However, when I try to initialize the model, I still encounter this error:

 ......           
  # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:121 in forward_native, code: residual = x.to(orig_dtype)                                                                                
            to_6: "bf16[s0, 2048][2048, 1]" = add_3.to(torch.bfloat16);  to_6 = None                                                   
                                                                                                                                                                                                                                                                              
             # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:138 in forward_native, code: variance = x_var.pow(2).mean(dim=-1, keepdim=True)                                                         
            pow_2: "f32[s0, 2048][2048, 1]" = add_3.pow(2)   
            mean_1: "f32[s0, 1][1, 1]" = pow_2.mean(dim = -1, keepdim = True);  pow_2 = None                                                                                                                                                                                  
                               
             # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:140 in forward_native, code: x = x * torch.rsqrt(variance + self.variance_epsilon)                                                      
            add_4: "f32[s0, 1][1, 1]" = mean_1 + 1e-06;  mean_1 = None                                                                                                                                                                                                        
            rsqrt_1: "f32[s0, 1][1, 1]" = torch.rsqrt(add_4);  add_4 = None                                                            
            mul_4: "f32[s0, 2048][2048, 1]" = add_3 * rsqrt_1;  add_3 = rsqrt_1 = None                                                 
                                                                                                                                                                                                                                                                              
             # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:141 in forward_native, code: x = x.to(orig_dtype)                                                                                       
            to_7: "bf16[s0, 2048][2048, 1]" = mul_4.to(torch.bfloat16);  mul_4 = None                                                  
                                                                                                                                       
             # File: /scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py:143 in forward_native, code: x = x * self.weight                                                                                        
            mul_5: "bf16[s0, 2048][2048, 1]" = to_7 * l_self_modules_norm_parameters_weight_;  to_7 = l_self_modules_norm_parameters_weight_ = None                                                                                                                           
            return mul_5                                                                                                               
                                                                                                                                       
                                                                                                                                       
Original traceback:                        
None                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                              
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"                                                                                            
                                                                                                                                       
Traceback (most recent call last):                                                                                                                                                                                                                                            
  File "/scratch/10436/jl77863/LLaMA-Factory/myscripts/eval.py", line 374, in <module>                                                                                                                                                                                        
    main()                                                                                                                             
  File "/scratch/10436/jl77863/LLaMA-Factory/myscripts/eval.py", line 255, in main                                                                                                                                                                                            
    model = LLM(                                                                                                                       
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 1149, in inner                   
    return fn(*args, **kwargs)        
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 248, in __init__                                                                                                                                              
    self.llm_engine = LLMEngine.from_engine_args(                                                                                                                                                                                                                             
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 522, in from_engine_args                                                                                                                                    
    return engine_cls.from_vllm_config(                                                                                                                                                                                                                                       
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 115, in from_vllm_config                                                                                                                                 
    return cls(vllm_config=vllm_config,
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 90, in __init__                                                                                                                                          
    self.engine_core = EngineCoreClient.make_client(                                                                                   
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 71, in make_client                                                                                                                                      
    return SyncMPClient(vllm_config, executor_class, log_stats)                                                                                                                                                                                                               
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__ 
    super().__init__(                                                                                                                  
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 395, in __init__                                                                                                                                        
    self._wait_for_engine_startup()                                                                                                    
  File "/scratch/10436/jl77863/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 421, in _wait_for_engine_startup                                                                                                                        
    raise RuntimeError("Engine core initialization failed. "                                                                                                                                                                                                                  
RuntimeError: Engine core initialization failed. See root cause above.

Wheels are with cu128: ubuntu22.04 python3.10 ubuntu24.04 python3.12 ubuntu24.04 python3.13

Apr 20 '25 21:04 johnnynunez

The PyTorch on https://pypi.jetson-ai-lab.dev/sbsa/cu128/ isn't built with CUDA which necessitates first installing PyTorch with index url https://download.pytorch.org/whl/cu128 and then installing vllm. Is this a mistake?

May 01 '25 21:05 coppock

does this issue already being FIX for the current vLLM latest version?

May 26 '25 09:05 syafiqmuda

vLLM 0.9.0.1 with latest LMCache 0.3.0 integrated for GH200

docker pull rajesh550/gh200-vllm:0.9.0.1

Dockerfile: https://github.com/rajesh-s/containers-for-gh200/blob/main/vllm/Dockerfile

Jun 04 '25 18:06 rajesh-s

Hello All,

Is there a target date for releasing aarch64 wheels via official pypi releases? As far as I see, we only provide x86_64 wheels today. Thanks!

Jul 03 '25 09:07 pramodk

how can i install vllm on ARM machine with RTX 4090？

Jul 11 '25 07:07 Yikit

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Oct 10 '25 02:10 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Nov 09 '25 02:11 github-actions[bot]