vllm
vllm copied to clipboard
[Hardware][Nvidia] Enable support for Pascal GPUs
[Hardware][Nvidia] Enable support for Pascal GPUs (sm_60, sm_61)
FIX: https://github.com/vllm-project/vllm/issues/963 https://github.com/vllm-project/vllm/issues/1284
Related: https://github.com/vllm-project/vllm/pull/4290 https://github.com/vllm-project/vllm/pull/2635
--
This is a new PR as a placeholder in the hope that the wheel size >100MB request is someday granted. This only adds compute capability 6.0 and 6.1. Note: pytorch is now only supporting sm_60.
>>> torch.__version__
'2.2.1+cu121'
>>> torch.cuda.torch.cuda.get_arch_list()
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
>>>
Pascal Architecture
- (+) SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
- (+) SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
- (-) SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2
Example test on 4 x P100 GPUs on CUDA 12.2 system:
# build
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-openai --no-cache
# run
docker run -d \
--shm-size=10.24gb \
--gpus '"device=0,1,2,3"' \
-v /data/models:/root/.cache/huggingface \
--env "HF_TOKEN=xyz" \
-p 8000:8000 \
--restart unless-stopped \
--name vllm-openai \
vllm-openai \
--host 0.0.0.0 \
--model=mistralai/Mistral-7B-Instruct-v0.1 \
--enforce-eager \
--dtype=float \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size=4
@youkaichao As I see it, pypi/support#3792 has been approved. Is it possible to merge this PR now?
(From Release Tracker)
https://github.com/vllm-project/vllm/pull/4409 might need a little bit more discussion given what features are supported for Pascal GPUs and whether building from source might be a better option.
I've been using vLLM on my P40s every day for almost a month now, and everything works fine. triton didn't accept one of my patches (they said we dropped support for pre-A100 GPUs, so I think there will soon be problems with other older architectures as well.), so things that depend on triton and use the tl.dot operation won't work (prefix caching, for example). However, there is a patched triton (sasha0552/triton), and just installing the patched triton is easier than installing both the patched triton and the patched vLLM. Also considering that the basic functionality works fine without triton.
Maybe the patched triton could be shipped like nccl (although not installed by default)? The patch is very simple, and I don't think it would be hard to maintain. I can maintain support for Pascal GPUs, if needed (I'm not going to move on from these GPUs until better options become available for the price per VRAM GB).
P.S. Whoever is reading this, you might want to check out my project, which has pre-built vllm and triton wheels for Pascal GPUs (and also patches & build scripts).
[Hardware][Nvidia] Enable support for Pascal GPUs (sm_60, sm_61)
FIX: #963 #1284
Related: #4290 #2635
--
This is a new PR as a placeholder in the hope that the wheel size >100MB request is someday granted. This only adds compute capability 6.0 and 6.1. Note: pytorch is now only supporting sm_60.
>>> torch.__version__ '2.2.1+cu121' >>> torch.cuda.torch.cuda.get_arch_list() ['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90'] >>>Pascal Architecture
- (+) SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
- (+) SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
- (-) SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2
Example test on 4 x P100 GPUs on CUDA 12.2 system:
# build DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-openai --no-cache # run docker run -d \ --shm-size=10.24gb \ --gpus '"device=0,1,2,3"' \ -v /data/models:/root/.cache/huggingface \ --env "HF_TOKEN=xyz" \ -p 8000:8000 \ --restart unless-stopped \ --name vllm-openai \ vllm-openai \ --host 0.0.0.0 \ --model=mistralai/Mistral-7B-Instruct-v0.1 \ --enforce-eager \ --dtype=float \ --gpu-memory-utilization 0.95 \ --tensor-parallel-size=4
Does this mean I can't run vllm on a Tesla P4, Even a small model?
Does this mean I can't run vllm on a Tesla P4, Even a small model?
@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?
Does this mean I can't run vllm on a Tesla P4, Even a small model?
@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?
I have tested it by installing with pip. It didn't work.
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 323, in forward [rank0]: output[num_prefill_tokens:] = PagedAttention.forward_decode( [rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
I intend to try your code now.
Does this mean I can't run vllm on a Tesla P4, Even a small model?
@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?
Oh, it works! Thank you!
Could we get an update on the status of this PR? I've been eagerly awaiting it, as I can't use vllm until it supports my hardware.
@dirkson it was answered here https://github.com/vllm-project/vllm/issues/6434#issuecomment-2231636764
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!
Not stale. Also, this PR only increases the wheel size by 10MB, so please consider.
Wanted to express interest in pascal support too - thank you all for all your work on these projects
Tested the Docker provided and this pinaly made it work with some of my older gpus
+1
Yes still relevant - ~~Where are these pre-built docker images that are mentioned?~~ They are here if people missed it (like me): https://github.com/sasha0552/pascal-pkgs-ci
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @jasonacox.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
I really have no idea how to fix this. Any suggestions?
I really have no idea how to fix this. Any suggestions?
![]()
Don't worry about this issue, committer can fix it directly.
Closing based on the stance taken in https://github.com/vllm-project/vllm/issues/6434#issuecomment-2231636764