Woosuk Kwon comments

Results 281 comments of


                                            Woosuk Kwon

[Core] Support full cuda graph in v1

Thanks for the PR! I will review it this weekend (maybe Tyler and Rob, too).

[Core] Support full cuda graph in v1

@youkaichao > if we can figure out the conditions, we can try to enable it automatically I think, without introducing a new user interface like level 4 optimization. To my...

[Core] Support full cuda graph in v1

Hmm.... For some reason, I see lower performance for Llama 3.2 1B with the full cuda graphs, compared to piecewise cuda graphs.

[Core] Support full cuda graph in v1

@alexm-redhat It's ``` python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.2-1B --batch-size 1 --input-len 4096 --output-len 50 --no-enable-prefix-caching --compilation-config {"'full_cuda_graph': True"} ``` I think it makes sense because the full graph capture essentially disables...

[RFC]: Performance Roadmap

@mpjlu Thanks for the good insight!

Move requirements into their own directory

Thanks for doing this. I'm super excited about this cleanup.

[V1] [Spec Decode] Support random sampling for spec decode

Thanks for the PR! Please ping me when the PR is ready for (final) review.

[V1] [Spec Decode] Support random sampling for spec decode

@LiuXiaoxuanPKU I will take a look, but what do you mean by "almost"? 😅 Just curious.

[V1] [Spec Decode] Support random sampling for spec decode

@LiuXiaoxuanPKU As a sanity check, can you please run a simple perf benchmark? I'm just wondering if we missed anything critical.

[V1] [Spec Decode] Support random sampling for spec decode

@LiuXiaoxuanPKU Is the PR ready for merge?