Chanh Nguyen issues

Repositories
Issues
Comments

Results 2 issues of


                                            Chanh Nguyen

Speed up decode by remove synchronizing operation in sampler

This code inside `apply_penalties` does advanced indexing on a tensor which triggers `nonzero` which requires a CPU sync currently with PyTorch. With `torch.cuda.set_sync_debug_mode("warn")` PyTorch framework confirms this: ``` /home/coder/vllm/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:1067: UserWarning:...

[Core] Support full cuda graph in v1

## Summary Support capturing a single CUDA graph for the entire model's forward pass, instead of piecewise graphs. This requires creating persistent buffers to make attention graphable. Credit to @tlrmchlsmth...

documentation

ci/build