Cody Yu comments

Results 161 comments of


                                            Cody Yu

[Core] Use flashinfer sampling kernel when available

@simon-mo what do we do if isort and yapf are conflicting?

[Core] Use flashinfer sampling kernel when available

@peng1999 please let me know when this is available for the final review and I'll try to get this in asap. Thanks

[Core] Use flashinfer sampling kernel when available

@peng1999 can you look into the CI failure?

[Core] Use flashinfer sampling kernel when available

That's understandable. We could disable FlashInfer sampling in this case. Meanwhile we may want to note somewhere to encourage users to disable it when they found discrepancy (and unacceptable) outputs....

[Core] Use flashinfer sampling kernel when available

Hmm I suppose this would be the case everywhere then...I'll then suggest the following: 1. We disable FlashInfer sampling by default and use the env variable to enable it in...

[Core] Use flashinfer sampling kernel when available

> @comaniac it looked like he shared top-logprobs already from the gptq test? If it isn't using logprobs, I agree we should change that Yeah ideally we should leverage logprobs...

Fix verify tokens with the correct bonus token

> The error comes from sampling. We can not guarantee the output will be all matched even if the target model is the same as the draft model because sampling...

[MISC] Dump model runner inputs when crashing

> Can we add instructions in the [GitHub issues template](https://github.com/vllm-project/vllm/blob/main/.github/ISSUE_TEMPLATE/400-bug%20report.yml) so users can share their logs upon encountering such errors? Good point. Will do

[MISC] Dump model runner inputs when crashing

@DarkLight1337 added to issue template. PTAL.

[MISC] Dump model runner inputs when crashing

It should be fine as we never load it automatically? But yeah you may get virus if someone post a malicious pickle file to an issue...