Chanh Nguyen
Results
2
issues of
Chanh Nguyen
This code inside `apply_penalties` does advanced indexing on a tensor which triggers `nonzero` which requires a CPU sync currently with PyTorch. With `torch.cuda.set_sync_debug_mode("warn")` PyTorch framework confirms this: ``` /home/coder/vllm/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:1067: UserWarning:...
## Summary Support capturing a single CUDA graph for the entire model's forward pass, instead of piecewise graphs. This requires creating persistent buffers to make attention graphable. Credit to @tlrmchlsmth...
documentation
ci/build
v1