Casper

Results 295 comments of Casper

ExLlama is not the same as GPTQ. ExLlama is a library built with a ton of optimizations around specifically the Llama model. It just chose to use GPTQ for quantization....

> I know that exllama has some optimizations on cuda core. In fact, I mainly want to know that if awq uses GPU optimization technology, will the performance of awq...

It seems TinyChat is currently very CPU-bound for all other models than LLaMa. On A6000, 3090, 4090 with AMD EPYC 7-Series CPU, performance is largely the same due to low...

> TGI That sounds great! :) Only thing to keep in mind is that TGI has recently switched license, so be careful if you plan to use their code. Edit:...

XGen is a LLaMa model and is already supported. Posting an easy-to-use script here for you to get started. Note that this assumes you have: 1. built AWQ 2. `pip...

Yes, every time you load the model, it replaces some layers with their custom layers that uses the CUDA kernels. Only the quantized layers run in INT4 while the rest...

> Do you have a guess on why does it happen? Why do you need to run the search on GPU? It may be due to different precision on GPU/CPU...

Marlin used a different method for measuring perplexity, so can’t compare the two unfortunately

FYI, Devin (the closed one) is fully integrated with VS Code. That means it has access to all the extensions, linting, and debugging tools. I think this is something to...

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow...