Philipp Moritz
Philipp Moritz
Why don't we test if there is a performance overhead (probably the compiler is already smart enough to optimize that -- it should be since the argument is constant in...
It is not about naming, more about having all these special cases and little conversion utilities :)
Could you do a bisection of the commits between 0.4.0 and 0.4.1 to pinpoint which is the commit that caused the issue? There is a possibility that this is fixed...
If I revert https://github.com/vllm-project/vllm/pull/2152, the memory leak goes away :)
There is no memory leak with `--enforce-eager` on the main branch. I believe this is some bad interaction between the torch collective communications and cuda graph :)
I can dig into this more later today and see if I can figure out where exactly the leak is happening :)
My current workaround is to use cupy as before https://github.com/vllm-project/vllm/pull/2152, that's working well. I haven't found the root cause of the bug with torch.distributed.all_reduce yet unfortunately :(
@WoosukKwon I believe this memory leak is removed by https://github.com/vllm-project/vllm/pull/2192 so maybe that's a way forward to fix this without needing cupy, what do you think?
I believe this is fixed with https://github.com/vllm-project/vllm/pull/2192 now if the custom allreduce kernel is used, please comment on the issue or open a new one if you still see a...
Nice, thanks for making these changes, this looks a bunch cleaner now! Optional suggestion that would be even cleaner: Rename `supports_checkpoint` to `override_quantization_method` with a different signature (see below) and...