Mark Schmidt
Mark Schmidt
> For a version that only uses the stdlib: > > ``` > #include > #include > #include > #include > #include > > // Perform Cholesky decomposition on a...
You need to follow the Windows specific GPTQ 4bit compilation instructions in this issue on GPTQ-for-LLaMA: https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1462643016
I set up a very basic one-click free colab web demo of ChatGLM in case anyone is itching to try it: [](https://colab.research.google.com/github/MarkSchmidty/ChatGLM-6B-Int4-Web-Demo/blob/main/ChatGLM-6B_int4_Web_Demo.ipynb)
> The 7B will run on a single GPU, but the other models _require_ multiple. The LLaMA sample code also really wants a lot of VRAM (16GB seems to be...
LLaMA-int8 implementation: https://github.com/tloen/llama-int8 LLaMA CPU implementation: https://github.com/markasoftware/llama-cpu LLaMA torrent download Magnet link: https://github.com/facebookresearch/llama/pull/73/files
This project appears to have 4-bit working for LLaMA: https://github.com/qwopqwop200/GPTQ-for-LLaMa May be helpful here.
> The pre-quantized 4bit llama is working without flexgen but I think perf suffers a bunch. Wonder if flexgen with 8-bit mode is better/faster? Looks like it still doesn't support...
>  > > 1.2 tokens/s on a Samsung S22 Ultra running 4 threads. > > The S22 obviously has a...
>So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model? It's clear by now that llama.cpp speed mostly depends...
> Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization...