Woosuk Kwon

Results 281 comments of Woosuk Kwon

Thanks for the PR! I will review it this weekend (maybe Tyler and Rob, too).

@youkaichao > if we can figure out the conditions, we can try to enable it automatically I think, without introducing a new user interface like level 4 optimization. To my...

Hmm.... For some reason, I see lower performance for Llama 3.2 1B with the full cuda graphs, compared to piecewise cuda graphs.

@alexm-redhat It's ``` python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.2-1B --batch-size 1 --input-len 4096 --output-len 50 --no-enable-prefix-caching --compilation-config {"'full_cuda_graph': True"} ``` I think it makes sense because the full graph capture essentially disables...

@mpjlu Thanks for the good insight!

Thanks for doing this. I'm super excited about this cleanup.

Thanks for the PR! Please ping me when the PR is ready for (final) review.

@LiuXiaoxuanPKU I will take a look, but what do you mean by "almost"? 😅 Just curious.

@LiuXiaoxuanPKU As a sanity check, can you please run a simple perf benchmark? I'm just wondering if we missed anything critical.

@LiuXiaoxuanPKU Is the PR ready for merge?