llama.cpp issues

accuracy: Q4 matrix multiply error is very bad for small K

In benchmark/benchmark-q4_0-matmult.c: Set sizey=sizez=N,sizex=K ```c++ For K=128,N=2, the deviation is expected 1020.00, got 1280.00 For K=128,N=32, the deviation is expected 262144.00, got 508160.03 For K=64,N=32 the deviation is expected 131072.00,...

jon-chuang

Use Threadpool to schedule the work

8

I cannot use this code to full utilize all CPU. based on PR #710 : 1. Remove finalizer 2. use similar tech like PR #850 3. optimize thread pool itself...

howard0su

threading

testing: allclose operator

1

allclose tests that all the floats in two tensors of identical size are within an epsilon error tolerance. See also: https://pytorch.org/docs/stable/generated/torch.allclose.html ```c++ bool allclose(ggml_tensor * a, ggml_tensor * b, f32...

jon-chuang

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors

The current `Q4_0` uses a single F32 floating-point scaling factor. An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: https://github.com/ggerganov/llama.cpp/commit/679e1cb6c01b16abe4f3ee3c849813b98970df93 Initial...

ggerganov

help wanted

high priority

research 🔬

85%+ of the llama model could be redundant

9

Turns out that most LLM parameters are redundant, see https://aclanthology.org/2020.emnlp-main.398.pdf. They run the experiment with BERT and XLNet. Code for the pruning is provided. There's lots of room for improvement...

teaalltr

More accurate Q4_0 and Q4_1 quantizations

7

### Update After seeing PR #835, I pushed some more changes that only affect the `Q4_0` results. I now get ``` rmse = 0.00185228 ``` for the 7B model. Perplexity...

ikawrakow

research 🔬

Fix for quantize fail on init due to undefined static initialization of complex objects

7

Addresses issue #920 Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works...

arikpoz

convert.py: Fix loading safetensors and ggml format on Windows

Calling `mmap.mmap` on Windows apparently resets the file offset of the raw file object (and makes the BufferedReader return a *negative* file offset). For safetensors, avoid using the file offset...

comex

The problem with the conversion with the new convert.py

5

Hello! Help me figure out: F:\Models\digitous-Alpacino13b>convert.py --dump-single F:\Models\digitous-Alpacino13b\4bit.safetensors Traceback (most recent call last): File "F:\Models\digitous-Alpacino13b\convert.py", line 1145, in main() File "F:\Models\digitous-Alpacino13b\convert.py", line 1116, in main model_plus = lazy_load_file(args.model) File "F:\Models\digitous-Alpacino13b\convert.py",...

SrVill

Unit test for quantization functions

4

Use `ggml_internal_get_quantize_fn` to loop through all quantization formats and run sanity checks on the implemented functions. They are run by ctest, but also accept a few command line parameters for...

unbounded

llama.cpp
llama.cpp copied to clipboard

Metadata

accuracy: Q4 matrix multiply error is very bad for small K

Use Threadpool to schedule the work

testing: allclose operator

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors

85%+ of the llama model could be redundant

More accurate Q4_0 and Q4_1 quantizations

Fix for quantize fail on init due to undefined static initialization of complex objects

convert.py: Fix loading safetensors and ggml format on Windows

The problem with the conversion with the new convert.py

Unit test for quantization functions

← Metadata

Owner

Metadata

llama.cpp llama.cpp copied to clipboard

Metadata

← Metadata

Owner

Metadata

llama.cpp
llama.cpp copied to clipboard