Dipika Sikka comments

Results 27 comments of


                                            Dipika Sikka

[Misc] `compressed-tensors` code reuse

@robertgshaw2-neuralmagic I think updating the `get_scheme` function is beyond this scope of this PR. I'd like to first land using compressed-tensors without any dependency conflicts. Refactoring `get_scheme` should be a...

[Kernel] Enable 8-bit weights in Fused Marlin MoE

/ready

[Kernel] Enable 8-bit weights in Fused Marlin MoE

> Just left one quick comment. I'm going to pull this PR in and try it with a compressed-tensors W8A16 model. Confirmed this works with compressed-tensors w8a16

[Kernel] Enable 8-bit weights in Fused Marlin MoE

Still need to test with a deepseek-v2 model

[Kernel] Enable 8-bit weights in Fused Marlin MoE

@ElizaWszola seems like the kernel test failures start after `tests/kernels/test_moe.py` - could you take a look?

[Misc] Update Fused MoE weight loading

/ready

[Misc] Update Fused MoE weight loading

> This LGTM but have you verified that DeepSeek MoE is okay with this PR? yes. deepkseek, mixtral and qwen

[ Kernel ] AWQ Fused MoE

/ready

[ Kernel ] AWQ Fused MoE

@mgoin can't resolve but addressed all but one comment

[ Kernel ] AWQ Fused MoE

Latency Benchmarking with Two 82 GB A100s: ``` Mixtral Fused MoE with AWQ: Avg latency: 1.3650233147976298 seconds 10% percentile latency: 1.3638953405432404 seconds 25% percentile latency: 1.3643284866120666 seconds 50% percentile latency:...