nano-vllm Can't run on dgx spark

I’m trying to run nano-vLLM on a DGX Spark box (Ubuntu 24.04.3 LTS, Python 3.12.3, NVIDIA GB10 GPU, CUDA driver 580.95.05, CUDA 13.0, compute capability 12.1). PyTorch is 2.9.0+cu130 with Triton 3.5.0. Installing nano-vLLM itself is fine, but its hard dependency on flash-attn is the blocker on this hardware/stack.

Environment: DGX Spark, Ubuntu 24.04.3 LTS, Python 3.12.3, NVIDIA GB10 GPU, driver 580.95.05, CUDA 13.0, compute capability 12.1. PyTorch 2.9.0+cu130 with Triton 3.5.0. nano-vLLM installs, but the hard dependency on flash-attn is the blocker on this stack.

Chronology and symptoms: Installing nano-vLLM drags in flash-attn. With build isolation, flash-attn’s setup can’t see torch (“No module named 'torch'”). After preinstalling torch (CUDA 13.0 wheel) and dev headers, source build still fails or hangs; eventually we hit PTX/arch issues. At runtime, Triton/ptxas dies: “Value 'sm_121a' is not defined for option 'gpu-name'” and PyTorch warns it officially supports only up to 12.0 while the GPU is 12.1.

I tried PIP_NO_BUILD_ISOLATION=1, building flash-attn from source, and then removing flash-attn from pyproject.toml to get nano-vLLM in, plus adding fallbacks to torch.nn.functional.scaled_dot_product_attention and replacing Triton KV-cache kernels with torch loops. Those hacks got us past import and through part of prefill, but decode crashes on API/shape gaps: unexpected block_table kwarg, tuple-vs-tensor mismatches, KV-cache shape/concat mismatches, and eventually Triton PTX errors when any kernel path slips in.

Bottom line: on this DGX Spark (CUDA 13.0, cc 12.1), current flash-attn/Triton toolchain isn’t compatible, and nano-vLLM’s reliance on them makes it brittle here. Either flash-attn needs official builds for cc 12.1 (and Triton/PTX support for sm_121), or nano-vLLM needs a robust, tested “no flash-attn / no Triton” path (API-compatible fallbacks for varlen and KV-cache). Until then, nano-vLLM won’t run reliably on this GPU

Nov 10 '25 22:11 letsrock85

Triton/ptxas dies: “Value 'sm_121a' is not defined for option 'gpu-name'”

@letsrock85 I tested on DGX and the code runs normally, for the ptxas problem, please refer to https://github.com/triton-lang/triton/issues/8539

Nov 11 '25 08:11 gbdjxgp

Triton/ptxas dies: “Value 'sm_121a' is not defined for option 'gpu-name'”

@letsrock85 I tested on DGX and the code runs normally, for the ptxas problem, please refer to triton-lang/triton#8539

Interesting. I know it’s a bit fiddly to ask, but could you give me a short rundown of how you installed and ran nano‑vllm? It would really help me out.

Nov 11 '25 19:11 letsrock85

just build flash-attn use source code: https://github.com/Dao-AILab/flash-attention, use pytorch=2.9.x+cu130

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

uv venv
git clone https://github.com/GeeeekExplorer/nano-vllm.git nano-vllm-dev/
cd nano-vllm-dev/
uv pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
uv pip install -v -e .
uv pip uninstall flash-attn
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/
uv pip install setuptools psutil
# build flash-attn, may take 2~3 hours
uv pip install -v . --no-build-isolation
cd ..


TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas python example.py 
/home/xxx/codes/nano-vllm-dev/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:283: UserWarning: 
    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    
  warnings.warn(
`torch_dtype` is deprecated! Use `dtype` instead!
Generating: 100%|██████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.19s/it, Prefill=6tok/s, Decode=82tok/s]


Prompt: '<|im_start|>user\nintroduce yourself<|im_end|>\n<|im_start|>assistant\n'
Completion: '<think>\nOkay, the user wants me to introduce myself. I should start by confirming my name, maybe say something like "My name is [Your Name]." Then, I need to mention my role or what I do. It\'s important to keep it friendly and open-ended so the user feels comfortable. I should avoid making it too technical. Let me think of a good opening line. Maybe something like, "I\'m [Your Name], and I\'m here to help you with [relevant topics]."\n\nWait, I should also add a bit about my experience. Maybe mention that I\'m a [role] with [specific experience]. But I need to keep it concise. Also, make sure to offer assistance so the user knows they can ask questions. Let me check if I\'m on the right track. Yeah, that should work. Alright, time to put it all together.\n</think>\n\nHello! I\'m [Your Name], and I\'m here to help you with a wide range of questions and topics. What would you like to discuss today? 😊<|im_end|>'


Prompt: '<|im_start|>user\nlist all prime numbers within 100<|im_end|>\n<|im_start|>assistant\n'
Completion: "<think>\nOkay, so I need to list all the prime numbers between 100. Let me think about how to approach this. First, I remember that a prime number is a number greater than 1 that has no positive divisors other than 1 and itself. So, I need to check each number starting from 100 upwards and see if it's prime.\n\nStarting with 100. Let me check if 100 is prime. Well, 100 is even, so it's divisible by 2. Since 2 is a prime number, 100 is not prime. That's the first one.\n\nNext, 101. Hmm, 101 is a prime number. Let me verify. To check if 101 is prime, I can try dividing it by primes less than its square root. The square root of 101 is approximately 10.05, so I need to check primes up to 10. Let's see: 101 divided by 2 is 50.5, not an integer. Divided by 3? 101 divided by 3 is about 33.666,"

Nov 13 '25 03:11 gbdjxgp

nano-vllm
nano-vllm copied to clipboard

Can't run on dgx spark - flash-attn issues

nano-vllm nano-vllm copied to clipboard

Can't run on dgx spark - flash-attn issues

nano-vllm
nano-vllm copied to clipboard