Chanh Nguyen

Results 2 issues of Chanh Nguyen

This code inside `apply_penalties` does advanced indexing on a tensor which triggers `nonzero` which requires a CPU sync currently with PyTorch. With `torch.cuda.set_sync_debug_mode("warn")` PyTorch framework confirms this: ``` /home/coder/vllm/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:1067: UserWarning:...

## Summary Support capturing a single CUDA graph for the entire model's forward pass, instead of piecewise graphs. This requires creating persistent buffers to make attention graphable. Credit to @tlrmchlsmth...

documentation
ci/build
v1