llama.cpp
llama.cpp copied to clipboard
LLM inference in C/C++
In benchmark/benchmark-q4_0-matmult.c: Set sizey=sizez=N,sizex=K ```c++ For K=128,N=2, the deviation is expected 1020.00, got 1280.00 For K=128,N=32, the deviation is expected 262144.00, got 508160.03 For K=64,N=32 the deviation is expected 131072.00,...
I cannot use this code to full utilize all CPU. based on PR #710 : 1. Remove finalizer 2. use similar tech like PR #850 3. optimize thread pool itself...
allclose tests that all the floats in two tensors of identical size are within an epsilon error tolerance. See also: https://pytorch.org/docs/stable/generated/torch.allclose.html ```c++ bool allclose(ggml_tensor * a, ggml_tensor * b, f32...
The current `Q4_0` uses a single F32 floating-point scaling factor. An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: https://github.com/ggerganov/llama.cpp/commit/679e1cb6c01b16abe4f3ee3c849813b98970df93 Initial...
Turns out that most LLM parameters are redundant, see https://aclanthology.org/2020.emnlp-main.398.pdf. They run the experiment with BERT and XLNet. Code for the pruning is provided. There's lots of room for improvement...
### Update After seeing PR #835, I pushed some more changes that only affect the `Q4_0` results. I now get ``` rmse = 0.00185228 ``` for the 7B model. Perplexity...
Addresses issue #920 Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works...
Calling `mmap.mmap` on Windows apparently resets the file offset of the raw file object (and makes the BufferedReader return a *negative* file offset). For safetensors, avoid using the file offset...
Hello! Help me figure out: F:\Models\digitous-Alpacino13b>convert.py --dump-single F:\Models\digitous-Alpacino13b\4bit.safetensors Traceback (most recent call last): File "F:\Models\digitous-Alpacino13b\convert.py", line 1145, in main() File "F:\Models\digitous-Alpacino13b\convert.py", line 1116, in main model_plus = lazy_load_file(args.model) File "F:\Models\digitous-Alpacino13b\convert.py",...
Use `ggml_internal_get_quantize_fn` to loop through all quantization formats and run sanity checks on the implemented functions. They are run by ctest, but also accept a few command line parameters for...