Thomas Parnell comments

Results 24 comments of


                                            Thomas Parnell

[Doc] Consistent naming of attention backends

@DarkLight1337 CI issues are fixed now.

[Bug]: topk=1 and temperature=0 cause different output in vllm

Is there any script that I can use to reproduce this issue? I've been looking into #5607 which appears related, but after some digging it, that bug seems to related...

[Bug]: fused_moe_kernel compile bug

Yeah, we've fixed this issue on our fork (as you found [here](https://github.com/IBM/vllm/pull/35)). Let me create a PR to contribute the fix upstream.

[Bug]: fused_moe_kernel compile bug

@randxie Interesting. I actually tried to test [these changes ](https://github.com/triton-lang/triton/pull/3544) that were merged into Triton main in[ our fork](https://github.com/IBM/vllm/pull/34), but it didn't help. I don't really see much else that...

[Bug]: fused_moe_kernel compile bug

There was a PR merged into Triton yesterday that tries to address this issue: https://github.com/triton-lang/triton/pull/4295. This fix is not yet included in `triton==3.0.0` which was released on PyPI yesterday.

[Bug]: fused_moe_kernel compile bug

So I've been digging into this a bit more and here is a summary of my findings: - Triton recently released v3.0.0, but it does **not** seem to include the...

[Bug]: fused_moe_kernel compile bug

Fix #6140 is ready from my pov, will try to get it approved and merged asap.

[Bugfix] Handle `best_of>1` case by disabling speculation.

> I am fine having this in, can we log once if this happens so there's a hint of the performance degredation to users? I added a warning when we...

[Bug]: Continuous usage stats are incorrect when chunked prefill is enabled

@njhill I saw you cleaned up this code recently. Did you happen to check the case with chunked prefill too? It looked like it was broken a couple of weeks...

[Performance]: FLASHINFER backend is slower than FLASH_ATTN on H100

Thanks @jeejeelee but that issue related to prefill performance. A quick look using torch profiler indicates that the majority of time is spent in decode kernel for both backends: using...