Thomas Parnell
Thomas Parnell
@DarkLight1337 CI issues are fixed now.
Is there any script that I can use to reproduce this issue? I've been looking into #5607 which appears related, but after some digging it, that bug seems to related...
Yeah, we've fixed this issue on our fork (as you found [here](https://github.com/IBM/vllm/pull/35)). Let me create a PR to contribute the fix upstream.
@randxie Interesting. I actually tried to test [these changes ](https://github.com/triton-lang/triton/pull/3544) that were merged into Triton main in[ our fork](https://github.com/IBM/vllm/pull/34), but it didn't help. I don't really see much else that...
There was a PR merged into Triton yesterday that tries to address this issue: https://github.com/triton-lang/triton/pull/4295. This fix is not yet included in `triton==3.0.0` which was released on PyPI yesterday.
So I've been digging into this a bit more and here is a summary of my findings: - Triton recently released v3.0.0, but it does **not** seem to include the...
Fix #6140 is ready from my pov, will try to get it approved and merged asap.
> I am fine having this in, can we log once if this happens so there's a hint of the performance degredation to users? I added a warning when we...
@njhill I saw you cleaned up this code recently. Did you happen to check the case with chunked prefill too? It looked like it was broken a couple of weeks...
Thanks @jeejeelee but that issue related to prefill performance. A quick look using torch profiler indicates that the majority of time is spent in decode kernel for both backends: using...