nano-vllm
nano-vllm copied to clipboard
Fix crash on prefill-after-preemption: conditionally serialize token_ids
Hi there. Since this is my first PR on github—feedback is very welcome. Thanks in advance for your review!
This PR fixes a crash that occurs when a sequence is preempted during decode and later re-enters the prefill path. In that scenario worker ranks hit:
AttributeError: 'Sequence' object has no attribute 'token_ids'
The root cause is that our cross-process Sequence serialization intentionally avoided sending token_ids after decode began, but prefill-after-preemption does need them.
Problem Reproduction
I use the following script to make this failure mode easy to reproduce and to sanity-check multi-GPU runs. The script drives a large batch and long prompts to increase the chance of preemption and re-prefill.
import time
import argparse
from random import randint, seed
from nanovllm import LLM, SamplingParams
MODEL_PATH = "model_path"
def build_prompts(num_seqs, max_input_len, min_input_len=100, vocab=10000):
lens = [randint(min_input_len, max_input_len) for _ in range(num_seqs)]
return [[randint(0, vocab) for _ in range(L)] for L in lens]
def parse_args():
p = argparse.ArgumentParser()
p.add_argument("--model", type=str, default=MODEL_PATH, help="HF model dir (local)")
p.add_argument("--tp", type=int, default=1, help="tensor parallel size")
p.add_argument("--num_seqs", type=int, default=256)
p.add_argument("--max_input_len", type=int, default=4096)
p.add_argument("--min_input_len", type=int, default=100)
p.add_argument("--max_output_len", type=int, default=256)
p.add_argument("--temperature", type=float, default=0.7)
p.add_argument("--max_model_len", type=int, default=4096)
p.add_argument("--vocab", type=int, default=10000,
help="toy vocab size for synthetic token IDs")
p.add_argument("--seed", type=int, default=0)
p.add_argument("--gpu_mem_util", type=float, default=0.9,
help="gpu_memory_utilization for KV cache planning")
p.add_argument("--enforce_eager", action="store_true",
help="disable CUDA graph (debug/compat)")
return p.parse_args()
def main():
args = parse_args()
seed(args.seed)
t0 = time.time()
llm = LLM(
args.model,
tensor_parallel_size=args.tp,
enforce_eager=args.enforce_eager,
max_model_len=args.max_model_len,
gpu_memory_utilization=args.gpu_mem_util,
)
init_time = time.time() - t0
print(f"[init] model={args.model} tp={args.tp} init_time={init_time:.3f}s")
prompt_token_ids = build_prompts(
num_seqs=args.num_seqs,
max_input_len=args.max_input_len,
min_input_len=args.min_input_len,
vocab=args.vocab,
)
sampling_params = [
SamplingParams(
temperature=args.temperature,
max_tokens=args.max_output_len,
ignore_eos=True,
) for _ in range(args.num_seqs)
]
t0 = time.time()
llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
t1 = time.time()
total_decode_tokens = args.num_seqs * args.max_output_len
total_time = t1 - t0
tokps = total_decode_tokens / total_time if total_time > 0 else float("inf")
print(f"[throughput] num_seqs={args.num_seqs} "
f"decode_tokens={total_decode_tokens} time={total_time:.3f}s "
f"throughput={tokps:.2f} tok/s")
if __name__ == "__main__":
main()
And then:
python bench_tp.py \
--num_seqs 256 \
--tp 2 \
--max_input_len 2048 \
--max_output_len 256 \
--temperature 0.7 \
--max_model_len 4096
Then it would crash and we would see the AttributeError above.
Following the nanovllm style, I aimed for a minimal, targeted fix; happy to iterate based on your feedback.
Wait, I find out bringing too high costs......I'll try optimizing.