Casper comments

Results 303 comments of


                                            Casper

Question about inference speed

ExLlama is not the same as GPTQ. ExLlama is a library built with a ton of optimizations around specifically the Llama model. It just chose to use GPTQ for quantization....

Question about inference speed

> I know that exllama has some optimizations on cuda core. In fact, I mainly want to know that if awq uses GPU optimization technology, will the performance of awq...

[Question/Feature] Fused attention/mlp/norm for MPT

It seems TinyChat is currently very CPU-bound for all other models than LLaMa. On A6000, 3090, 4090 with AMD EPYC 7-Series CPU, performance is largely the same due to low...

[Question/Feature] Fused attention/mlp/norm for MPT

> TGI That sounds great! :) Only thing to keep in mind is that TGI has recently switched license, so be careful if you plan to use their code. Edit:...

Hi, Could you also support xgen-7b-8k-inst ?

XGen is a LLaMa model and is already supported. Posting an easy-to-use script here for you to get started. Note that this assumes you have: 1. built AWQ 2. `pip...

Looking at the code, I see that there is an dequantisation process when actually doing the inference, i.e. the actual matrix multiplication is done with floating point arithmetic right?

Yes, every time you load the model, it replaces some layers with their custom layers that uses the CUDA kernels. Only the quantized layers run in INT4 while the rest...

Casper

Question about inference speed

Question about inference speed

[Question/Feature] Fused attention/mlp/norm for MPT

[Question/Feature] Fused attention/mlp/norm for MPT

Hi, Could you also support xgen-7b-8k-inst ?

Looking at the code, I see that there is an dequantisation process when actually doing the inference, i.e. the actual matrix multiplication is done with floating point arithmetic right?

Bad result when running AWQ without GPU

Does Marlin support zero-point quantization?

Use Language Server Protocol (LSP) to re-implement all code editing

Update multi-node.qmd