vllm [Hardware][Nvidia] Enable support for Pascal GPUs

[Hardware][Nvidia] Enable support for Pascal GPUs (sm_60, sm_61)

FIX: https://github.com/vllm-project/vllm/issues/963 https://github.com/vllm-project/vllm/issues/1284

Related: https://github.com/vllm-project/vllm/pull/4290 https://github.com/vllm-project/vllm/pull/2635

--

This is a new PR as a placeholder in the hope that the wheel size >100MB request is someday granted. This only adds compute capability 6.0 and 6.1. Note: pytorch is now only supporting sm_60.

>>> torch.__version__
'2.2.1+cu121'
>>> torch.cuda.torch.cuda.get_arch_list()
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
>>>

Pascal Architecture

(+) SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
(+) SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
(-) SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Example test on 4 x P100 GPUs on CUDA 12.2 system:

# build
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-openai --no-cache

 # run
docker run -d \
    --shm-size=10.24gb \
    --gpus '"device=0,1,2,3"' \
    -v /data/models:/root/.cache/huggingface \
    --env "HF_TOKEN=xyz" \
    -p 8000:8000 \
    --restart unless-stopped \
    --name vllm-openai \
    vllm-openai \
    --host 0.0.0.0 \
    --model=mistralai/Mistral-7B-Instruct-v0.1 \
    --enforce-eager \
    --dtype=float \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size=4

Apr 27 '24 05:04 jasonacox

@youkaichao As I see it, pypi/support#3792 has been approved. Is it possible to merge this PR now?

May 10 '24 01:05 sasha0552

(From Release Tracker)

https://github.com/vllm-project/vllm/pull/4409 might need a little bit more discussion given what features are supported for Pascal GPUs and whether building from source might be a better option.

I've been using vLLM on my P40s every day for almost a month now, and everything works fine. triton didn't accept one of my patches (they said we dropped support for pre-A100 GPUs, so I think there will soon be problems with other older architectures as well.), so things that depend on triton and use the tl.dot operation won't work (prefix caching, for example). However, there is a patched triton (sasha0552/triton), and just installing the patched triton is easier than installing both the patched triton and the patched vLLM. Also considering that the basic functionality works fine without triton.

Maybe the patched triton could be shipped like nccl (although not installed by default)? The patch is very simple, and I don't think it would be hard to maintain. I can maintain support for Pascal GPUs, if needed (I'm not going to move on from these GPUs until better options become available for the price per VRAM GB).

P.S. Whoever is reading this, you might want to check out my project, which has pre-built vllm and triton wheels for Pascal GPUs (and also patches & build scripts).

May 19 '24 06:05 sasha0552

[Hardware][Nvidia] Enable support for Pascal GPUs (sm_60, sm_61)

FIX: #963 #1284

Related: #4290 #2635

--

This is a new PR as a placeholder in the hope that the wheel size >100MB request is someday granted. This only adds compute capability 6.0 and 6.1. Note: pytorch is now only supporting sm_60.
>>> torch.__version__
'2.2.1+cu121'
>>> torch.cuda.torch.cuda.get_arch_list()
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
>>> 
Pascal Architecture

(+) SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)

(+) SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2

(-) SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Example test on 4 x P100 GPUs on CUDA 12.2 system:
# build
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-openai --no-cache

 # run
docker run -d \
    --shm-size=10.24gb \
    --gpus '"device=0,1,2,3"' \
    -v /data/models:/root/.cache/huggingface \
    --env "HF_TOKEN=xyz" \
    -p 8000:8000 \
    --restart unless-stopped \
    --name vllm-openai \
    vllm-openai \
    --host 0.0.0.0 \
    --model=mistralai/Mistral-7B-Instruct-v0.1 \
    --enforce-eager \
    --dtype=float \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size=4

Does this mean I can't run vllm on a Tesla P4, Even a small model?

Jul 02 '24 13:07 AslanEZ

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

Jul 03 '24 05:07 jasonacox

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

I have tested it by installing with pip. It didn't work.

[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 323, in forward [rank0]: output[num_prefill_tokens:] = PagedAttention.forward_decode( [rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I intend to try your code now.

Jul 03 '24 06:07 AslanEZ

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

Oh, it works! Thank you!

Jul 03 '24 08:07 AslanEZ

Could we get an update on the status of this PR? I've been eagerly awaiting it, as I can't use vllm until it supports my hardware.

Aug 12 '24 17:08 dirkson

@dirkson it was answered here https://github.com/vllm-project/vllm/issues/6434#issuecomment-2231636764

Aug 12 '24 19:08 sasha0552

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Nov 11 '24 01:11 github-actions[bot]

Not stale. Also, this PR only increases the wheel size by 10MB, so please consider.

Nov 11 '24 09:11 sasha0552

Wanted to express interest in pascal support too - thank you all for all your work on these projects

Dec 16 '24 19:12 aaron-asdf

Tested the Docker provided and this pinaly made it work with some of my older gpus

Jan 09 '25 18:01 torsteinelv

+1

Jan 20 '25 07:01 j0yk1ll

Yes still relevant - ~~Where are these pre-built docker images that are mentioned?~~ They are here if people missed it (like me): https://github.com/sasha0552/pascal-pkgs-ci