Philipp Moritz comments

Results 85 comments of


                                            Philipp Moritz

[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support

Why don't we test if there is a performance overhead (probably the compiler is already smart enough to optimize that -- it should be since the argument is constant in...

[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support

It is not about naming, more about having all these special cases and little conversion utilities :)

[Bug]: v0.4.1 The output results of the MoE kinds models are incorrect on the V100

Could you do a bisection of the commits between 0.4.0 and 0.4.1 to pinpoint which is the commit that caused the issue? There is a possibility that this is fixed...

Memory leak when using CUDA Graph with torch.distributed.all_reduce (vLLM default config)

If I revert https://github.com/vllm-project/vllm/pull/2152, the memory leak goes away :)

Memory leak when using CUDA Graph with torch.distributed.all_reduce (vLLM default config)

There is no memory leak with `--enforce-eager` on the main branch. I believe this is some bad interaction between the torch collective communications and cuda graph :)

Memory leak when using CUDA Graph with torch.distributed.all_reduce (vLLM default config)

I can dig into this more later today and see if I can figure out where exactly the leak is happening :)

Memory leak when using CUDA Graph with torch.distributed.all_reduce (vLLM default config)

My current workaround is to use cupy as before https://github.com/vllm-project/vllm/pull/2152, that's working well. I haven't found the root cause of the bug with torch.distributed.all_reduce yet unfortunately :(

Memory leak when using CUDA Graph with torch.distributed.all_reduce (vLLM default config)

@WoosukKwon I believe this memory leak is removed by https://github.com/vllm-project/vllm/pull/2192 so maybe that's a way forward to fix this without needing cupy, what do you think?

Memory leak when using CUDA Graph with torch.distributed.all_reduce (vLLM default config)

I believe this is fixed with https://github.com/vllm-project/vllm/pull/2192 now if the custom allreduce kernel is used, please comment on the issue or open a new one if you still see a...

Add GPTQ Marlin 2:4 sparse structured support

Nice, thanks for making these changes, this looks a bunch cleaner now! Optional suggestion that would be even cleaner: Rename `supports_checkpoint` to `override_quantization_method` with a different signature (see below) and...