nano-vllm icon indicating copy to clipboard operation
nano-vllm copied to clipboard

Fix crash on prefill-after-preemption: conditionally serialize token_ids

Open Terry-Uv opened this issue 2 months ago • 1 comments

Hi there. Since this is my first PR on github—feedback is very welcome. Thanks in advance for your review!

This PR fixes a crash that occurs when a sequence is preempted during decode and later re-enters the prefill path. In that scenario worker ranks hit:

AttributeError: 'Sequence' object has no attribute 'token_ids'

The root cause is that our cross-process Sequence serialization intentionally avoided sending token_ids after decode began, but prefill-after-preemption does need them.

Problem Reproduction

I use the following script to make this failure mode easy to reproduce and to sanity-check multi-GPU runs. The script drives a large batch and long prompts to increase the chance of preemption and re-prefill.

import time
import argparse
from random import randint, seed
from nanovllm import LLM, SamplingParams

MODEL_PATH = "model_path"

def build_prompts(num_seqs, max_input_len, min_input_len=100, vocab=10000):
    lens = [randint(min_input_len, max_input_len) for _ in range(num_seqs)]
    return [[randint(0, vocab) for _ in range(L)] for L in lens]

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--model", type=str, default=MODEL_PATH, help="HF model dir (local)")
    p.add_argument("--tp", type=int, default=1, help="tensor parallel size")
    p.add_argument("--num_seqs", type=int, default=256)
    p.add_argument("--max_input_len", type=int, default=4096)
    p.add_argument("--min_input_len", type=int, default=100)
    p.add_argument("--max_output_len", type=int, default=256)
    p.add_argument("--temperature", type=float, default=0.7)
    p.add_argument("--max_model_len", type=int, default=4096)
    p.add_argument("--vocab", type=int, default=10000,
                   help="toy vocab size for synthetic token IDs")
    p.add_argument("--seed", type=int, default=0)
    p.add_argument("--gpu_mem_util", type=float, default=0.9,
                   help="gpu_memory_utilization for KV cache planning")
    p.add_argument("--enforce_eager", action="store_true",
                   help="disable CUDA graph (debug/compat)")
    return p.parse_args()

def main():
    args = parse_args()
    seed(args.seed)

    t0 = time.time()
    llm = LLM(
        args.model,
        tensor_parallel_size=args.tp,
        enforce_eager=args.enforce_eager,
        max_model_len=args.max_model_len,
        gpu_memory_utilization=args.gpu_mem_util,
    )
    init_time = time.time() - t0
    print(f"[init] model={args.model} tp={args.tp} init_time={init_time:.3f}s")

    prompt_token_ids = build_prompts(
        num_seqs=args.num_seqs,
        max_input_len=args.max_input_len,
        min_input_len=args.min_input_len,
        vocab=args.vocab,
    )
    sampling_params = [
        SamplingParams(
            temperature=args.temperature,
            max_tokens=args.max_output_len,
            ignore_eos=True,
        ) for _ in range(args.num_seqs)
    ]

    t0 = time.time()
    llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
    t1 = time.time()

    total_decode_tokens = args.num_seqs * args.max_output_len
    total_time = t1 - t0
    tokps = total_decode_tokens / total_time if total_time > 0 else float("inf")
    print(f"[throughput] num_seqs={args.num_seqs} "
          f"decode_tokens={total_decode_tokens} time={total_time:.3f}s "
          f"throughput={tokps:.2f} tok/s")

if __name__ == "__main__":
    main()

And then:

python bench_tp.py \
  --num_seqs 256 \
  --tp 2 \
  --max_input_len 2048 \
  --max_output_len 256 \
  --temperature 0.7 \
  --max_model_len 4096

Then it would crash and we would see the AttributeError above.

Following the nanovllm style, I aimed for a minimal, targeted fix; happy to iterate based on your feedback.

Terry-Uv avatar Sep 15 '25 00:09 Terry-Uv

Wait, I find out bringing too high costs......I'll try optimizing.

Terry-Uv avatar Oct 10 '25 00:10 Terry-Uv