Casper comments

Results 293 comments of


                                            Casper

Support int8 KVCacheQuant and W8A8 inference in vllm

> it's possible to run in W8A16 but you can not benefit from int8 tensor core in that case We have a misunderstanding of what W8 and A8 means here....

Support int8 KVCacheQuant and W8A8 inference in vllm

I have implemented SmoothQuant quantization of input activations in [AutoAWQ/smoothquant](https://github.com/casper-hansen/AutoAWQ/pull/71). However, I am getting an error around how the model is loaded. This is the first layer when I print...

Support int8 KVCacheQuant and W8A8 inference in vllm

@AniZpZ Loading works for now with this PR: https://github.com/AniZpZ/vllm/pull/1. Could you please accept the PR? Then I will focus on optimizing the accuracy/perplexity of models. Also, what is the command...

Support int8 KVCacheQuant and W8A8 inference in vllm

Please look at the model loading code soon. Until then, I must use my fork to continue my development. The goal is to create W8A8 models with the KV8 cache...

Support int8 KVCacheQuant and W8A8 inference in vllm

I have conducted more experiments that achieve the same results as in the paper. There is only one problem: per-channel weight quantization is not compatible with the CUDA kernels because...

[BOUNTY] Optimized Triton Kernels for full fine tunes

For those looking for inspiration to claim the $1200 for Mixtral: - [MegaBlocks (Triton)](https://github.com/stanford-futuredata/megablocks/blob/main/megablocks/backend/kernels.py): A good baseline for communication / various distributed operations. - [vLLM Expert Parallelism PR (Triton)](https://github.com/scv119/vllm/tree/moe): An...

Casper

Support int8 KVCacheQuant and W8A8 inference in vllm

Support int8 KVCacheQuant and W8A8 inference in vllm

Support int8 KVCacheQuant and W8A8 inference in vllm

Support int8 KVCacheQuant and W8A8 inference in vllm

Support int8 KVCacheQuant and W8A8 inference in vllm

[BOUNTY] Optimized Triton Kernels for full fine tunes

[BOUNTY] Optimized Triton Kernels for full fine tunes

[BOUNTY] Optimized Triton Kernels for full fine tunes

GGUF support

GGUF support