vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Hardware][Nvidia] Enable support for Pascal GPUs

Open jasonacox opened this issue 1 year ago • 15 comments

[Hardware][Nvidia] Enable support for Pascal GPUs (sm_60, sm_61)

FIX: https://github.com/vllm-project/vllm/issues/963 https://github.com/vllm-project/vllm/issues/1284

Related: https://github.com/vllm-project/vllm/pull/4290 https://github.com/vllm-project/vllm/pull/2635

--

This is a new PR as a placeholder in the hope that the wheel size >100MB request is someday granted. This only adds compute capability 6.0 and 6.1. Note: pytorch is now only supporting sm_60.

>>> torch.__version__
'2.2.1+cu121'
>>> torch.cuda.torch.cuda.get_arch_list()
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
>>> 

Pascal Architecture

  • (+) SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
  • (+) SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
  • (-) SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Example test on 4 x P100 GPUs on CUDA 12.2 system:

# build
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-openai --no-cache

 # run
docker run -d \
    --shm-size=10.24gb \
    --gpus '"device=0,1,2,3"' \
    -v /data/models:/root/.cache/huggingface \
    --env "HF_TOKEN=xyz" \
    -p 8000:8000 \
    --restart unless-stopped \
    --name vllm-openai \
    vllm-openai \
    --host 0.0.0.0 \
    --model=mistralai/Mistral-7B-Instruct-v0.1 \
    --enforce-eager \
    --dtype=float \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size=4

jasonacox avatar Apr 27 '24 05:04 jasonacox

@youkaichao As I see it, pypi/support#3792 has been approved. Is it possible to merge this PR now?

sasha0552 avatar May 10 '24 01:05 sasha0552

(From Release Tracker)

https://github.com/vllm-project/vllm/pull/4409 might need a little bit more discussion given what features are supported for Pascal GPUs and whether building from source might be a better option.

I've been using vLLM on my P40s every day for almost a month now, and everything works fine. triton didn't accept one of my patches (they said we dropped support for pre-A100 GPUs, so I think there will soon be problems with other older architectures as well.), so things that depend on triton and use the tl.dot operation won't work (prefix caching, for example). However, there is a patched triton (sasha0552/triton), and just installing the patched triton is easier than installing both the patched triton and the patched vLLM. Also considering that the basic functionality works fine without triton.

Maybe the patched triton could be shipped like nccl (although not installed by default)? The patch is very simple, and I don't think it would be hard to maintain. I can maintain support for Pascal GPUs, if needed (I'm not going to move on from these GPUs until better options become available for the price per VRAM GB).

P.S. Whoever is reading this, you might want to check out my project, which has pre-built vllm and triton wheels for Pascal GPUs (and also patches & build scripts).

sasha0552 avatar May 19 '24 06:05 sasha0552

[Hardware][Nvidia] Enable support for Pascal GPUs (sm_60, sm_61)

FIX: #963 #1284

Related: #4290 #2635

--

This is a new PR as a placeholder in the hope that the wheel size >100MB request is someday granted. This only adds compute capability 6.0 and 6.1. Note: pytorch is now only supporting sm_60.

>>> torch.__version__
'2.2.1+cu121'
>>> torch.cuda.torch.cuda.get_arch_list()
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
>>> 

Pascal Architecture

  • (+) SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
  • (+) SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
  • (-) SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Example test on 4 x P100 GPUs on CUDA 12.2 system:

# build
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-openai --no-cache

 # run
docker run -d \
    --shm-size=10.24gb \
    --gpus '"device=0,1,2,3"' \
    -v /data/models:/root/.cache/huggingface \
    --env "HF_TOKEN=xyz" \
    -p 8000:8000 \
    --restart unless-stopped \
    --name vllm-openai \
    vllm-openai \
    --host 0.0.0.0 \
    --model=mistralai/Mistral-7B-Instruct-v0.1 \
    --enforce-eager \
    --dtype=float \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size=4

Does this mean I can't run vllm on a Tesla P4, Even a small model?

AslanEZ avatar Jul 02 '24 13:07 AslanEZ

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

jasonacox avatar Jul 03 '24 05:07 jasonacox

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

I have tested it by installing with pip. It didn't work.

[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 323, in forward [rank0]: output[num_prefill_tokens:] = PagedAttention.forward_decode( [rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I intend to try your code now.

AslanEZ avatar Jul 03 '24 06:07 AslanEZ

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

Oh, it works! Thank you!

AslanEZ avatar Jul 03 '24 08:07 AslanEZ

Could we get an update on the status of this PR? I've been eagerly awaiting it, as I can't use vllm until it supports my hardware.

dirkson avatar Aug 12 '24 17:08 dirkson

@dirkson it was answered here https://github.com/vllm-project/vllm/issues/6434#issuecomment-2231636764

sasha0552 avatar Aug 12 '24 19:08 sasha0552

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions[bot] avatar Nov 11 '24 01:11 github-actions[bot]

Not stale. Also, this PR only increases the wheel size by 10MB, so please consider.

sasha0552 avatar Nov 11 '24 09:11 sasha0552

Wanted to express interest in pascal support too - thank you all for all your work on these projects

aaron-asdf avatar Dec 16 '24 19:12 aaron-asdf

Tested the Docker provided and this pinaly made it work with some of my older gpus

torsteinelv avatar Jan 09 '25 18:01 torsteinelv

+1

j0yk1ll avatar Jan 20 '25 07:01 j0yk1ll

Yes still relevant - ~~Where are these pre-built docker images that are mentioned?~~ They are here if people missed it (like me): https://github.com/sasha0552/pascal-pkgs-ci

hs-ye avatar Jan 30 '25 12:01 hs-ye

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @jasonacox.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Feb 05 '25 19:02 mergify[bot]

I really have no idea how to fix this. Any suggestions?

image

jasonacox avatar Feb 19 '25 05:02 jasonacox

I really have no idea how to fix this. Any suggestions?

image

Don't worry about this issue, committer can fix it directly.

jeejeelee avatar Feb 19 '25 06:02 jeejeelee

Closing based on the stance taken in https://github.com/vllm-project/vllm/issues/6434#issuecomment-2231636764

hmellor avatar Feb 28 '25 13:02 hmellor