vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Add SM120 to the Dockerfile

Open mgoin opened this issue 6 months ago • 9 comments

Now that https://github.com/vllm-project/vllm/pull/19336 has landed, maybe we can add SM 12.0 without going over the 400MB wheel limit

EDIT: The wheel is 365MB!

mgoin avatar Jun 18 '25 07:06 mgoin

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Jun 18 '25 07:06 github-actions[bot]

What's the new wheel size? :-)

houseroad avatar Jun 18 '25 08:06 houseroad

The wheel is 365MB!

mgoin avatar Jun 18 '25 09:06 mgoin

The wheel is 365MB!

Sounds awesome! I'll try to confirm. Currently building the whole thing on my desktop, it'll take a while:

~/vllm$ git status
On branch neuralmagic-add-sm120-dockerfile
Your branch is up to date with 'neuralmagic/add-sm120-dockerfile'.
nothing to commit, working tree clean

~/vllm$ git log -1
commit f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 (HEAD -> neuralmagic-add-sm120-dockerfile, neuralmagic/add-sm120-dockerfile)
Author: Michael Goin <[email protected]>
Date:   Thu Jun 19 01:21:08 2025 +0900

    Update Dockerfile

~/vllm$ DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

# edit:   --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' is not needed anymore of course
# then extract the wheel from the build stage, check size, and build image via target vllm-openai
  • ~❌ confirm the new wheel size of 365MB edit: nope, the new wheel size is 832.61 MiB when building for the new default arch list (same as --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'), see this comment~ edit: ✅ confirmed, see https://github.com/vllm-project/vllm/pull/19794#issuecomment-2987394903

  • ✅ confirm SM 120 compability (for FlashInfer, too) edit: probably needs huydhn's rebuilt wheel for the new arch list. edit: Yes, else I get the error

    RuntimeError: TopKMaskLogits failed with error code no kernel image is available for execution on the device

    edit: tested on RTX 5090, it works now with the new flashinfer wheel

cyril23 avatar Jun 18 '25 23:06 cyril23

The wheel is 365MB!

Do you mean for SM 120 (torch_cuda_arch_list='12.0') only? What have you tested exactly?

  • I am sorry but the wheel size for ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' (your new default) is 832.61 MiB which is still too big. I've built it based on your branch, with default settings, see https://github.com/vllm-project/vllm/pull/19794#issuecomment-2986042680
  • Output
#23 DONE 23847.4s

#24 [build 7/8] COPY .buildkite/check-wheel-size.py check-wheel-size.py
#24 DONE 0.0s

#25 [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi
#25 0.274 Not allowed: Wheel dist/vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl is larger (832.61 MB) than the limit (400 MB).
#25 0.274 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1882.29 MBs uncompressed.
#25 0.274 vllm/_C.abi3.so: 752.47 MBs uncompressed.
#25 0.274 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
#25 0.274 vllm/_moe_C.abi3.so: 164.88 MBs uncompressed.
#25 0.274 vllm/_flashmla_C.abi3.so: 4.89 MBs uncompressed.
#25 0.274 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
#25 0.274 vllm/config.py: 0.20 MBs uncompressed.
#25 0.274 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
#25 0.274 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
#25 0.274 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
#25 ERROR: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1
------
 > [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi:
0.274 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1882.29 MBs uncompressed.
0.274 vllm/_C.abi3.so: 752.47 MBs uncompressed.
0.274 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
0.274 vllm/_moe_C.abi3.so: 164.88 MBs uncompressed.
0.274 vllm/_flashmla_C.abi3.so: 4.89 MBs uncompressed.
0.274 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
0.274 vllm/config.py: 0.20 MBs uncompressed.
0.274 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
0.274 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
0.274 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
------
Dockerfile:155
--------------------
 154 |     ARG RUN_WHEEL_CHECK=true
 155 | >>> RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
 156 | >>>         python3 check-wheel-size.py dist; \
 157 | >>>     else \
 158 | >>>         echo "Skipping wheel size check."; \
 159 | >>>     fi
 160 |     #################### EXTENSION Build IMAGE ####################
--------------------
ERROR: failed to solve: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1
  • Running the command again but this time with disabled wheel-size-check in order to let me extract the wheel and finish the build-image:
DOCKER_BUILDKIT=1 sudo docker build   --build-arg max_jobs=5   --build-arg USE_SCCACHE=0   --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1 --build-arg RUN_WHEEL_CHECK=fals
e --tag wurstdeploy/vllm:wheel-stage   --target build   --progress plain   -f docker/Dockerfile .
sudo docker create --name temp-wheel-container wurstdeploy/vllm:wheel-stage
sudo docker cp temp-wheel-container:/workspace/dist ./extracted-wheels
sudo docker rm temp-wheel-container
ls -la extracted-wheels/
# output:
total 852604
drwxr-xr-x  2 root     root          4096 Jun 19 08:35 .
drwxr-xr-x 16 freeuser freeuser      4096 Jun 19 08:50 ..
-rw-r--r--  1 root     root     873053002 Jun 19 08:36 vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl
# That's 873 MB or 832.61 MiB
  • I've uploaded wurstdeploy/vllm:wheel-stage to Docker Hub so you can take a look at it yourself.
  • the old wheel size (before https://github.com/vllm-project/vllm/pull/19336 had landed) was 922.18 MB when compiling with ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0+PTX 10.1 12.0 12.1', so I guess at least the wheel size shrunk moderatly, see https://github.com/vllm-project/vllm/issues/13306#issuecomment-2940556134

Unfortunately we still can't update the defaults of the Dockerfile to include SM120, without touching anything else, because it'd be applied to building the CUDA 12.8 wheel here, too, and Pypi's limit of currently 400 MB is too low (even increasing it to 800 MB would not be enough). How could we solve this problem:

  1. Either we keep your changes to the main Dockerfile as you did in this PR but build for specific architectures within the Build wheel - CUDA 12.8 step here: 1.1 Either by adding --build-arg torch_cuda_arch_list='12.0' (I havn't confirmed your 365MB yet when building 12.0 only) to make a SM120-only-compatible build, incompatible for all older achitectures like SM 100 Blackwell and older. 1.2 Or by adding the old default --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0' (with or without PTX, does not matter) the CUDA 12.8 wheel would still be incompatible for SM 120 Blackwell but works for SM 100 Blackwell and all older gens. So just like the current wheel.
  2. Or we do not update the main Dockerfile but explicitly add something like --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0 10.0 12.0' --build-arg RUN_WHEEL_CHECK=false to the Docker Build release image step here which was the idea of my PR https://github.com/vllm-project/vllm/pull/19747

I prefer solution 1.2. What do you guys think? @mgoin

cyril23 avatar Jun 19 '25 07:06 cyril23

Hey @cyril23 thanks for the concern but the "build image" job in CI succeeds. This is the source of truth for wheel size and is now building for '7.0 7.5 8.0 8.9 9.0 10.0 12.0': https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367

I think you aren't building the image the "right way" if you are getting such a large wheel size. Perhaps you are building with Debug information rather than a proper Release build like we use for CI and release?

mgoin avatar Jun 19 '25 08:06 mgoin

my wheels are bigger because I build it with USE_SCCACHE=0 and thus not building CMAKE_BUILD_TYPE=Release but including debug symbols etc.

I think you aren't building the image the "right way" if you are getting such a large wheel size. Perhaps you are building with Debug information rather than a proper Release build like we use for CI and release?

I wish I built it the wrong way, so we could just merge this PR. I've built it as shown here which got me 832.61 MiB wheel size.

~/vllm$ DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

Now I've just tried building again for SM 120 only:

# on Azure Standard E96s v6 (96 vcpus, 768 GiB memory); actually used Max: 291289 MiB RAM
DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=384 \
  --build-arg nvcc_threads=4 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg torch_cuda_arch_list='12.0' \
  --tag wurstdeploy/vllm:wheel-stage-120only \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

Result:

#24 [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi
#24 0.251 Not allowed: Wheel dist/vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl is larger (558.31 MB) than the limit (400 MB).
#24 0.251 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1504.86 MBs uncompressed.
#24 0.251 vllm/_C.abi3.so: 297.77 MBs uncompressed.
#24 0.251 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
#24 0.251 vllm/_moe_C.abi3.so: 95.23 MBs uncompressed.
#24 0.251 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
#24 0.251 vllm/config.py: 0.20 MBs uncompressed.
#24 0.251 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
#24 0.251 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
#24 0.251 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
#24 0.251 vllm/worker/hpu_model_runner.py: 0.10 MBs uncompressed.
#24 ERROR: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1

After extracting the wheels:

azureuser@building:~/vllm$ ls -la extracted-wheels/
total 571720
drwxr-xr-x  2 root      root           4096 Jun 19 08:09 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 08:14 ..
-rw-r--r--  1 root      root      585426919 Jun 19 08:10 vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl
# thats 558 MB or 558.31 MiB

I am not sure what https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367 did differently? They build the "test" target.

Anyway as long as it works on buildkite I am happy! Would love to understand the differences though.

edit: this is what buildkite did:

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7 #!/bin/bash if [[ -z $(docker manifest inspect public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1) ]]; then echo "Image not found, proceeding with build..." else echo "Image found" exit 0 fi

docker build --file docker/Dockerfile --build-arg max_jobs=16 --build-arg buildkite_commit=f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 --build-arg USE_SCCACHE=1 --tag public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 --target test --progress plain . docker push public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1

edit: the differences:

  • I use a different number of max_job which shouldn't affect the wheel size
  • I did USE_SCCACHE=0 instead of 1 as in buildkite - can this affect the wheel size? YES, thanks Gemini:

The difference in wheel size between your local build and the Buildkite build is most likely due to the USE_SCCACHE build argument and its effect on the build type.

Here's a breakdown of why this is happening:

The Root Cause In the docker/Dockerfile, the USE_SCCACHE argument controls which build path is taken. When USE_SCCACHE is set to 1 (as it is in the Buildkite CI), the build command also sets CMAKE_BUILD_TYPE=Release:

# docker/Dockerfile

...
RUN --mount=type=bind,source=.git,target=.git \
    if [ "$USE_SCCACHE" = "1" ]; then \
        echo "Installing sccache..." \
...
        && export CMAKE_BUILD_TYPE=Release \
        && sccache --show-stats \
        && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
...
    fi
...

However, when USE_SCCACHE is not 1 (you are setting it to 0), the other build path is taken, and CMAKE_BUILD_TYPE is not set:

# docker/Dockerfile

...
ENV CCACHE_DIR=/root/.cache/ccache
RUN --mount=type=cache,target=/root/.cache/ccache \
    --mount=type=cache,target=/root/.cache/uv \
    --mount=type=bind,source=.git,target=.git  \
    if [ "$USE_SCCACHE" != "1" ]; then \
        # Clean any existing CMake artifacts
        rm -rf .deps && \
        mkdir -p .deps && \
        python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
    fi
...

When CMAKE_BUILD_TYPE is not explicitly set, CMake often defaults to a Debug build, which includes debugging symbols and is not optimized for size. This is why your locally built wheel is so much larger. The huge size of the .so files in your output is a strong indicator of this.

You didn't accidentally set any debug flags; you accidentally missed setting the release flag!

How to Fix It To fix this, you have two options:

  1. Set USE_SCCACHE in your build command:

The easiest solution is to mimic the CI environment by setting --build-arg USE_SCCACHE=1 in your docker build command. This will ensure that CMAKE_BUILD_TYPE=Release is set.

DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=1 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .
  1. Modify the Dockerfile:

If you prefer to build without sccache locally, you can modify the Dockerfile to set the CMAKE_BUILD_TYPE for both build paths. This would make local builds more consistent with CI builds, regardless of the USE_SCCACHE setting.

Here is a diff of the proposed change:

--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -141,6 +141,7 @@
     if [ "$USE_SCCACHE" != "1" ]; then \
         # Clean any existing CMake artifacts
         rm -rf .deps && \
+        export CMAKE_BUILD_TYPE=Release && \
         mkdir -p .deps && \
         python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
     fi

By making one of these changes, you should see your wheel size decrease significantly and fall within the acceptable range.

edit: I'll propose a new ARG CMAKE_BUILD_TYPE=Release build argument in a separate issue to allow for creating a Release type build even without using SCCACHE.

cyril23 avatar Jun 19 '25 08:06 cyril23

The wheel is 365MB!

Now I've verified that using CMAKE_BUILD_TYPE=Release with default arches indeed results in a 382 MB file (365.10 MiB) i.e. exactly as in the buildkite run https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367

azureuser@building:~/vllm$ ls -la extracted-wheels/
total 373876
drwxr-xr-x  2 root      root           4096 Jun 19 09:20 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 09:22 ..
-rw-r--r--  1 root      root      382836018 Jun 19 09:21 vllm-0.9.2.dev139+gf3bddb6d6.d20250619-cp38-abi3-linux_x86_64.whl
azureuser@building:~/vllm$

By the way I've further tested that using CMAKE_BUILD_TYPE=Release for SM 120-only (--build-arg torch_cuda_arch_list='12.0') now results in a 167 MB (159.34 MiB) small wheel .

azureuser@building:~/vllm/extracted-wheels$ ls -la
total 163172
drwxr-xr-x  2 root      root           4096 Jun 19 09:00 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 09:02 ..
-rw-r--r--  1 root      root      167077290 Jun 19 09:01 vllm-0.9.2.dev139+gf3bddb6d6.d20250619-cp38-abi3-linux_x86_64.whl
azureuser@building:~/vllm/extracted-wheels$

In order to test it without using SCCACHE I've modified my Dockerfile as follows (I'll make an issue about it):

diff --git a/docker/Dockerfile b/docker/Dockerfile
index 8d4375470..ae866edd0 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -112,6 +112,7 @@ ENV MAX_JOBS=${max_jobs}
 ARG nvcc_threads=8
 ENV NVCC_THREADS=$nvcc_threads

+ARG CMAKE_BUILD_TYPE=Release
 ARG USE_SCCACHE
 ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
 ARG SCCACHE_REGION_NAME=us-west-2
@@ -129,7 +130,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
         && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
         && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
         && export SCCACHE_IDLE_TIMEOUT=0 \
-        && export CMAKE_BUILD_TYPE=Release \
+        && export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} \
         && sccache --show-stats \
         && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
         && sccache --show-stats; \
@@ -143,6 +144,7 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
         # Clean any existing CMake artifacts
         rm -rf .deps && \
         mkdir -p .deps && \
+        export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} && \
         python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
     fi

So let's merge! 👍

cyril23 avatar Jun 19 '25 09:06 cyril23

With the new FlashInfer wheel, I've tried it out with RTX 5090 (but just build it using torch_cuda_arch_list='12.0', and CMAKE_BUILD_TYPE=Release) and inference works without a problem

edit: by the way the wheel size is pretty much the same like with the old FlashInfer version (compared to https://github.com/vllm-project/vllm/pull/19794#issuecomment-2987394903)

~/vllm$ ls -la extracted-wheels/
total 163192
drwxr-xr-x  2 root     root          4096 Jun 20 10:10 .
drwxr-xr-x 16 freeuser freeuser      4096 Jun 20 10:16 ..
-rw-r--r--  1 root     root     167097574 Jun 20 10:10 vllm-0.9.2.dev182+g47c454049.d20250620-cp38-abi3-linux_x86_64.whl

cyril23 avatar Jun 20 '25 09:06 cyril23

  1. 1.2 Or by adding the old default --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0' (with or without PTX, does not matter) the CUDA 12.8 wheel would still be incompatible for SM 120 Blackwell but works for SM 100 Blackwell and all older gens. So just like the current wheel.

@cyril23 could you please provide context as to why that would be the case regarding PTX?

If you compile CUDA kernels with PTX, any earlier Compute Capability (CC) should be able to be compiled by newer GPUs, and they could use that.

There would be some overhead (at least on first run) as PTX is compiled to cubin at runtime, and not targeting newer CC of that GPU would be less optimal (perf impact varies) but should still work.

The only time this doesn't really workout is when the PTX is built with a newer version of CUDA than runtime uses.


The builder image is using nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 (with a CUDA_VERSION ARG of 12.8.1), being as high as it is would prevent the PTX being compatible if your runtime was using CUDA 12.8.0.

Beyond that, since you're also relying on PyTorch which bundles it's own CUDA libraries, depending on the CUDA release there you'll also have each library with embedded PTX/cubin. If they are lacking sm_120 cubin or any of the PTX was built with that CUDA compatibility issue mentioned, then you'd not have valid GPU kernels to load.


You've not mentioned what version of CUDA you're using at runtime, but it's possible that the compatibility issue was related to these caveats I've described.

CUDA 12.8.0 can target sm_120, and the existing CC 10.0 PTX should have been compatible, so the only scenario that comes to mind is due to CUDA 12.8.1 in the image builder, when your runtime might have been using CUDA 12.8.0 still?

$ docker run --rm -it nvidia/cuda:12.8.0-devel-ubuntu24.04

$ nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_120

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

If there is some other compatibility caveat, I'd appreciate more details, as --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0' with PTX should have otherwise worked 🤔

polarathene avatar Jun 22 '25 06:06 polarathene

This is not advisable btw:

# Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/

That updates /etc/ld.so.cache (equivalent of LD_LIBRARY_PATH) to include this location for libcuda.so.1, and also creates a libcuda.so symlink.

Instead you can replace compat/ with lib64/stubs which should have a libcuda.so file if needed for linking. This is only present in the devel image as it's only relevant to building. At runtime a proper libcuda.so should be provided.


If you try to use the image for runtime purposes with that, and the compat version of libcuda.so.1 is used instead of the one from your actual driver, this can introduce issues like the CUDA device not being detected.

These compat packages are not intended to be used with newer versions of CUDA, you can't use CUDA 12.9 on the host and swap for an earlier CUDA 12.8 compat package.

polarathene avatar Jun 22 '25 06:06 polarathene

@cyril23 could you please provide context as to why that would be the case regarding PTX?

If you compile CUDA kernels with PTX, any earlier Compute Capability (CC) should be able to be compiled by newer GPUs, and they could use that.

@polarathene You're right that generally it should run with 10.0+PTX (or any older version+PTX). And this is actually the first time I ran it without kernel problems, maybe before I had the wrong CUDA version or flashinfer was not compatible or who knows what I did wrong. Anyhow today I've built it from this branch neuralmagic:add-sm120-dockerfile with --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='10.0+PTX', and indeed it ran, but somehow produced gibberish: image

Build: build-10ptx.log

# building without SCCACHE and non-release, therefore better deactivating wheel check:
DOCKER_BUILDKIT=1 sudo docker build \
 --build-arg max_jobs=6 \
 --build-arg nvcc_threads=1 \
 --build-arg USE_SCCACHE=0 \
 --build-arg GIT_REPO_CHECK=1 \
 --build-arg RUN_WHEEL_CHECK=false \
 --build-arg CUDA_VERSION=12.8.1 \
 --build-arg torch_cuda_arch_list='10.0+PTX' \
 --tag wurstdeploy/vllm:sm100ptxonly \
 --target vllm-openai \
 --progress plain \
 -f docker/Dockerfile .

Run:

~/vllm$ sudo docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface     -p 8000:8000  wurstdeploy/vllm:sm100ptxonly    --model Qwen/Qwen3-0.6B
INFO 06-22 02:53:54 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 02:53:56 [api_server.py:1287] vLLM API server version 0.9.2.dev182+g47c454049
INFO 06-22 02:53:56 [cli_args.py:309] non-default args: {'model': 'Qwen/Qwen3-0.6B'}
INFO 06-22 02:54:06 [config.py:831] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 06-22 02:54:06 [config.py:1444] Using max model len 40960
INFO 06-22 02:54:06 [config.py:2197] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 06-22 02:54:09 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 02:54:10 [core.py:459] Waiting for init message from front-end.
INFO 06-22 02:54:10 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev182+g47c454049) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-22 02:54:12 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f9b41cd5310>
INFO 06-22 02:54:12 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 06-22 02:54:12 [interface.py:383] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 06-22 02:54:12 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 06-22 02:54:12 [gpu_model_runner.py:1691] Starting to load model Qwen/Qwen3-0.6B...
INFO 06-22 02:54:12 [gpu_model_runner.py:1696] Loading model from scratch...
INFO 06-22 02:54:12 [cuda.py:270] Using Flash Attention backend on V1 engine.
INFO 06-22 02:54:16 [weight_utils.py:292] Using model weights format ['*.safetensors']
INFO 06-22 02:54:16 [weight_utils.py:345] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.29it/s]

INFO 06-22 02:54:17 [default_loader.py:272] Loading weights took 0.79 seconds
INFO 06-22 02:54:17 [gpu_model_runner.py:1720] Model loading took 1.1201 GiB and 5.040871 seconds
INFO 06-22 02:54:21 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/ef58c0fce0/rank_0_0/backbone for vLLM's torch.compile
INFO 06-22 02:54:21 [backends.py:519] Dynamo bytecode transform time: 3.76 s
INFO 06-22 02:54:23 [backends.py:181] Cache the graph of shape None for later use
INFO 06-22 02:54:36 [backends.py:193] Compiling a graph for general shape takes 14.65 s
INFO 06-22 02:54:47 [monitor.py:34] torch.compile takes 18.40 s in total
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
INFO 06-22 02:54:47 [gpu_worker.py:232] Available KV cache memory: 26.91 GiB
INFO 06-22 02:54:47 [kv_cache_utils.py:716] GPU KV cache size: 251,920 tokens
INFO 06-22 02:54:47 [kv_cache_utils.py:720] Maximum concurrency for 40,960 tokens per request: 6.15x
WARNING 06-22 02:54:47 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
Capturing CUDA graphs: 100%|██████████| 67/67 [00:14<00:00,  4.71it/s]
INFO 06-22 02:55:02 [gpu_model_runner.py:2196] Graph capturing finished in 14 secs, took 0.84 GiB
INFO 06-22 02:55:02 [core.py:172] init engine (profile, create kv cache, warmup model) took 44.51 seconds
INFO 06-22 02:55:02 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 15745
WARNING 06-22 02:55:02 [config.py:1371] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 06-22 02:55:02 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 06-22 02:55:03 [serving_completion.py:66] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 06-22 02:55:03 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 06-22 02:55:03 [launcher.py:29] Available routes are:
INFO 06-22 02:55:03 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 06-22 02:55:03 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 06-22 02:55:03 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 06-22 02:55:03 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 06-22 02:55:03 [launcher.py:37] Route: /health, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /load, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /ping, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /ping, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /version, Methods: GET
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /pooling, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /classify, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /score, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /rerank, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /invocations, Methods: POST
INFO 06-22 02:55:03 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 06-22 02:55:18 [chat_utils.py:420] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 06-22 02:55:18 [logger.py:43] Received request chatcmpl-1f0f012cce7f4176b9d627ec57efc7a0: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n/no_think What is the capital of France? Tell me 2 sentences about it<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=40924, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-22 02:55:18 [async_llm.py:270] Added request chatcmpl-1f0f012cce7f4176b9d627ec57efc7a0.
INFO 06-22 02:55:43 [loggers.py:118] Engine 000: Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-22 02:55:53 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     172.17.0.1:35466 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-22 02:56:03 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-22 02:56:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-22 02:57:33 [logger.py:43] Received request chatcmpl-f956c735835145b692f08178d679a17b: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n/no_think What is the capital of France? Tell me 2 sentences about it<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=40924, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-22 02:57:33 [async_llm.py:270] Added request chatcmpl-f956c735835145b692f08178d679a17b.
INFO 06-22 02:57:43 [loggers.py:118] Engine 000: Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 193.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 44.4%
INFO 06-22 02:57:53 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 194.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 44.4%
INFO 06-22 02:58:03 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 190.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 44.4%
INFO:     172.17.0.1:54410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-22 02:58:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 44.4%
INFO 06-22 02:58:23 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 44.4%

Another try with cURL, same prompt: image

systeminfo.txt including nvidia smi etc.

By the way similar gibberish when running vLLM with -e VLLM_USE_FLASHINFER_SAMPLER=0: VLLM_USE_FLASHINFER_SAMPLER=0.txt

I've uploaded my build wurstdeploy/vllm:sm100ptxonly to Dockerhub in case you want to test it.

cyril23 avatar Jun 22 '25 10:06 cyril23

indeed it ran, but somehow produced gibberish

I don't own a Blackwell GPU so I cannot test.

I have heard that CUDA 12.8 had some issues with compiling properly for blackwell archs, as you're on CUDA 12.9 on the host, perhaps try version bumping the CUDA version in the Dockerfile stages you're using, or alternatively only build for an earlier arch like Ada (sm_89, RTX 4xxx) with PTX for CC 8.9 (I'm personally more interested in this to rule out the CUDA 12.8 / CC 10.0 concern).

If you have issues with either of those still, then try bring the version of CUDA from the builder down to an earlier CUDA version (I still seem some projects on CUDA 12.2 / 12.4 for their image builds).

It would be helpful information for other projects trying to update their support for Blackwell, I've seen a few other project PRs where there is some reluctance to bump the builder stage CUDA.

polarathene avatar Jun 22 '25 10:06 polarathene

@cyril23 when you run the container, can you show this output?

# This is a public image from the CI, but you could use your image `wurstdeploy/vllm:sm100ptxonly`:
# (Use entrypoint to switch to bash shell instead in container instead)
$ docker run --rm -it --runtime nvidia --gpus all --entrypoint bash \
  public.ecr.aws/q9t5s3a7/vllm-release-repo:b6553be1bc75f046b00046a4ad7576364d03c835

# In the container run this command to see which `libcuda.so.1` is resolved:
# (this is the output without `--runtime nvidia --gpus all`, but it shouldn't have been cached)
$ ldconfig -p | grep -F 'libcuda.so'
        libcuda.so.1 (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/local/cuda-12.8/compat/libcuda.so

Does it resolve to that same /usr/local/cuda-12.8/compat like above?

If it does run ldconfig command by itself with nothing else after it, and then repeat the same ldconfig -p with grep above, it should show a different path to your proper libcuda.so.1, such as:

ldconfig -p | grep -F libcuda.so
        libcuda.so.1 (libc6,x86-64) => /usr/lib64/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/lib64/libcuda.so

If it's already using /usr/lib64 by default nevermind 🤔


I'm also not quite familiar why the runtime image is over 11GB (21GB uncompressed)?

There's the CUDA libs from the image itself (/usr/local/cuda, 7GB), plus another copy in the Python packages /usr/local/lib/python3.12/dist-packages/nvidiais 4GB, dist-packages is 11GB total.

I think there are some linking mistakes as a result of this...?

$ `ls -lh /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib
total 857M
-rw-r--r-- 1 root root    0 Jun 10 12:08 __init__.py
-rw-r--r-- 1 root root 111M Jun 10 12:08 libcublas.so.12
-rw-r--r-- 1 root root 745M Jun 10 12:08 libcublasLt.so.12
-rw-r--r-- 1 root root 737K Jun 10 12:08 libnvblas.so.12


# Notice how the library is resolving `libcublasLt.so.12` to the non-local one instead?
$ ldd /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib/libcublas.so.12
        linux-vdso.so.1 (0x00007ffc0e7af000)
        libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007571c6800000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007571ffefb000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007571ffef6000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007571ffef1000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007571ffe0a000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007571ffde8000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007571c65d7000)
        /lib64/ld-linux-x86-64.so.2 (0x00007571fff0a000)


# Different number of cubins for `sm_120`, these two lib copies are not equivalent:
$ cuobjdump --list-elf /usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib/libcublasLt.so.12 | grep sm_120 | wc -l
1380

$ cuobjdump --list-elf /usr/local/cuda/lib64/libcublasLt.so.12 | grep sm_120 | wc -l
1432

I wouldn't be surprised if the above contributes to some of the issues encountered?

polarathene avatar Jun 22 '25 12:06 polarathene

@polarathene

In the container run this command to see which libcuda.so.1 is resolved:

~$ sudo docker run --rm -it --runtime nvidia --gpus all --entrypoint bash \
  public.ecr.aws/q9t5s3a7/vllm-release-repo:b6553be1bc75f046b00046a4ad7576364d03c835
root@666374f39152:/vllm-workspace# ldconfig -p | grep -F 'libcuda.so'
        libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
root@666374f39152:/vllm-workspace# exit
exit
freeuser@computer:~$ sudo docker run --rm -it --runtime nvidia --gpus all --entrypoint bash   wurstdeploy/vllm:sm100ptxonly
root@0c651f6c9519:/vllm-workspace# ldconfig -p | grep -F 'libcuda.so'
        libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so
root@0c651f6c9519:/vllm-workspace# exit
exit

I have heard that CUDA 12.8 had some issues with compiling properly for blackwell archs, as you're on CUDA 12.9 on the host,

Here is a test with my wurstdeploy/vllm:sm100ptxonly on simplepod.ai (not affiliated with them, only use it for testing) with a NVIDIA GeForce RTX 5060 Ti and older CUDA Version 12.8:

root@rri_UWM3TmWsoh7cshEI:/vllm-workspace# nvidia-smi
Sun Jun 22 05:39:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.153.02             Driver Version: 570.153.02     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     On  |   00000000:01:00.0 Off |                  N/A |
| 35%   36C    P1             15W /  180W |   13678MiB /  16311MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             159      C   /usr/bin/python3                      13668MiB |
+-----------------------------------------------------------------------------------------+
root@rri_UWM3TmWsoh7cshEI:/vllm-workspace# 

image Similar gibberish: image

or alternatively only build for an earlier arch like Ada (sm_89, RTX 4xxx) with PTX for CC 8.9 (I'm personally more interested in this to rule out the CUDA 12.8 / CC 10.0 concern).

I'll do a build now with sm 89 + ptx, and test it tonight:

DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=6 \
  --build-arg nvcc_threads=1 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg RUN_WHEEL_CHECK=false \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg torch_cuda_arch_list='8.9+PTX' \
  --tag wurstdeploy/vllm:sm89ptxonly \
  --target vllm-openai  \
  --progress plain \
  -f docker/Dockerfile .

cyril23 avatar Jun 22 '25 12:06 cyril23

my 5090 is really looking forward to this being released :) keep up the good work

Johann-Foerster avatar Jun 22 '25 13:06 Johann-Foerster

I'll do a build now with sm 89 + ptx, and test it tonight:

@polarathene same gibberish with '8.9+PTX' for my RTX 5090, see logs build+run_8.9+PTX.log (inference starts at line 7080)

edit: I've pushed this build 8.9+ptx too: wurstdeploy/vllm:sm89ptxonly which is build with --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='8.9+PTX'.

edit: I've tested wurstdeploy/vllm:sm89ptxonly with RTX 4060 Ti too, but it has CUDA 12.7, therefore I get this

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.8, please update your driver to a newer version, or use an earlier cuda container: unknown. Please contact with support.

edit: I can try building --build-arg CUDA_VERSION=12.7 --build-arg torch_cuda_arch_list='8.9+PTX' but I think I need a version nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 and I can only find 12.6.3 here which should be fine too for Ada Lovelace. Therefore I'll do another build with --build-arg CUDA_VERSION=12.6.3 --build-arg torch_cuda_arch_list='8.9+PTX'

edit: result of --build-arg CUDA_VERSION=12.6.3 --build-arg torch_cuda_arch_list='8.9+PTX': builderror_8.9+PTX_CUDA_12.6.3.log, see in short:

#35 10.51 FAILED: batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_pybind.cuda.o
#35 10.51 /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_pybind.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /vllm-workspace/flashinfer/include -isystem /vllm-workspace/flashinfer/csrc -isystem /vllm-workspace/flashinfer/3rdparty/cutlass/include -isystem /vllm-workspace/flashinfer/3rdparty/cutlass/tools/util/include -isystem /vllm-workspace/flashinfer/3rdparty/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_120,code=sm_120 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_90a,code=sm_90a -O3 -std=c++17 --threads=4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /vllm-workspace/flashinfer/build/aot/generated/batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_pybind.cu -o batch_prefill_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_dtype_idx_i32_head_dim_qk_64_head_dim_vo_64_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False/batch_prefill_jit_pybind.cuda.o
#35 10.51 nvcc fatal   : Unsupported gpu architecture 'compute_100a'
#35 16.82 [18/554] c++ -MMD -MF logging/logging.o.d -DTORCH_EXTENSION_NAME=logging -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -I/vllm-workspace/flashinfer/3rdparty/spdlog/include -I/vllm-workspace/flashinfer/include -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /vllm-workspace/flashinfer/include -isystem /vllm-workspace/flashinfer/csrc -isystem /vllm-workspace/flashinfer/3rdparty/cutlass/include -isystem /vllm-workspace/flashinfer/3rdparty/cutlass/tools/util/include -isystem /vllm-workspace/flashinfer/3rdparty/spdlog/include -fPIC -O3 -std=c++17 -Wno-switch-bool -c /vllm-workspace/flashinfer/csrc/logging.cc -o logging/logging.o
#35 16.82 ninja: build stopped: subcommand failed.

cyril23 avatar Jun 22 '25 17:06 cyril23

@polarathene

I have heard that CUDA 12.8 had some issues with compiling properly for blackwell archs, as you're on CUDA 12.9 on the host, perhaps try version bumping the CUDA version in the Dockerfile stages you're using

I've tried building with --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='8.9+PTX' and got a build error with this combination: builderror_8.9+PTX_CUDA_12.6.3.log, see in short:

#27 21.90 CMake Error at /usr/local/lib/python3.12/dist-packages/torch/share/cmake/Caffe2/public/cuda.cmake:186 (set_property):
#27 21.90   The link interface of target "torch::nvtoolsext" contains:
#27 21.90
#27 21.90     CUDA::nvToolsExt
#27 21.90
#27 21.90   but the target was not found.

I've tried building with --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='10.0+PTX' and got a similar build error: builderror_10.0+PTX_CUDA_12.9.1.txt, see in short:

#31 21.92 CMake Error at /usr/local/lib/python3.12/dist-packages/torch/share/cmake/Caffe2/public/cuda.cmake:186 (set_property):
#31 21.92   The link interface of target "torch::nvtoolsext" contains:
#31 21.92
#31 21.92     CUDA::nvToolsExt
#31 21.92
#31 21.92   but the target was not found.

Not sure what else to test or how to modify the Dockerfile accordingly to make it work. But I think we should make a separate issue about that.

Anyway my takeaway is that building with PTX is not worth it because in order to support a new GPU gen, so many parameters must align (CUDA toolkit and host's GPU driver must match, PyTorch, probably some libraries and modules, CUDA base image), and even then it might result in gibberish output, so I think we should just omit the +PTX flag.

cyril23 avatar Jun 22 '25 19:06 cyril23

root@srv-ia-010:/var/tmp# curl -O https://raw.githubusercontent.com/vllm-project/vllm/2dd24ebe1538be19fd7b3da8d2bfeed45b0955c4/docker/Dockerfile
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16424  100 16424    0     0  46974      0 --:--:-- --:--:-- --:--:-- 46925
root@srv-ia-010:/var/tmp# docker build -t vllm:custom -f Dockerfile .

got this warning:

WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 163)

celsowm avatar Jun 24 '25 21:06 celsowm

It doesn't work with GPT-OSS 20/120B even though vLLM is now supported, Flash Attention 3 is not.

jayadityashah avatar Aug 11 '25 12:08 jayadityashah