Casper

Results 293 comments of Casper

> it's possible to run in W8A16 but you can not benefit from int8 tensor core in that case We have a misunderstanding of what W8 and A8 means here....

I have implemented SmoothQuant quantization of input activations in [AutoAWQ/smoothquant](https://github.com/casper-hansen/AutoAWQ/pull/71). However, I am getting an error around how the model is loaded. This is the first layer when I print...

@AniZpZ Loading works for now with this PR: https://github.com/AniZpZ/vllm/pull/1. Could you please accept the PR? Then I will focus on optimizing the accuracy/perplexity of models. Also, what is the command...

Please look at the model loading code soon. Until then, I must use my fork to continue my development. The goal is to create W8A8 models with the KV8 cache...

I have conducted more experiments that achieve the same results as in the paper. There is only one problem: per-channel weight quantization is not compatible with the CUDA kernels because...

For those looking for inspiration to claim the $1200 for Mixtral: - [MegaBlocks (Triton)](https://github.com/stanford-futuredata/megablocks/blob/main/megablocks/backend/kernels.py): A good baseline for communication / various distributed operations. - [vLLM Expert Parallelism PR (Triton)](https://github.com/scv119/vllm/tree/moe): An...

> For Mixtral additional kernels required are both sparse and grouped permute_and_compute as well as kernels for gating experts. Here is my answer specific to Mixtral. Solutions that achieve a...

Triton kernel for expert computation in MoE compatible with float16 and bfloat16. Speed up of 2.3-5x dependent on batch size. You would just need to make it compatible with axolotl...

FYI, high throughput is hard when using quantized models in general, regardless of which framework. But if you can manage to run with batch size (data parallelism) of less than...

> Does the AWQ implementation support higher than 4 bits per weight, for example 8 bits? Not yet. It’s 4-bit only at the moment.