Mark Schmidt

Results 95 comments of Mark Schmidt

> For a version that only uses the stdlib: > > ``` > #include > #include > #include > #include > #include > > // Perform Cholesky decomposition on a...

You need to follow the Windows specific GPTQ 4bit compilation instructions in this issue on GPTQ-for-LLaMA: https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11#issuecomment-1462643016

I set up a very basic one-click free colab web demo of ChatGLM in case anyone is itching to try it: [![Launch In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MarkSchmidty/ChatGLM-6B-Int4-Web-Demo/blob/main/ChatGLM-6B_int4_Web_Demo.ipynb)

> The 7B will run on a single GPU, but the other models _require_ multiple. The LLaMA sample code also really wants a lot of VRAM (16GB seems to be...

LLaMA-int8 implementation: https://github.com/tloen/llama-int8 LLaMA CPU implementation: https://github.com/markasoftware/llama-cpu LLaMA torrent download Magnet link: https://github.com/facebookresearch/llama/pull/73/files

This project appears to have 4-bit working for LLaMA: https://github.com/qwopqwop200/GPTQ-for-LLaMa May be helpful here.

> The pre-quantized 4bit llama is working without flexgen but I think perf suffers a bunch. Wonder if flexgen with 8-bit mode is better/faster? Looks like it still doesn't support...

> ![llama.cpp on Samsung S22 Ultra at 1.2 tokens per second](https://user-images.githubusercontent.com/5949853/224798872-d3a1e9d8-d0ce-4261-b1a8-247c2a154a9f.png) > > 1.2 tokens/s on a Samsung S22 Ultra running 4 threads. > > The S22 obviously has a...

>So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model? It's clear by now that llama.cpp speed mostly depends...

> Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization...