jon-chuang
jon-chuang
In benchmark/benchmark-q4_0-matmult.c: Set sizey=sizez=N,sizex=K ```c++ For K=128,N=2, the deviation is expected 1020.00, got 1280.00 For K=128,N=32, the deviation is expected 262144.00, got 508160.03 For K=64,N=32 the deviation is expected 131072.00,...
allclose tests that all the floats in two tensors of identical size are within an epsilon error tolerance. See also: https://pytorch.org/docs/stable/generated/torch.allclose.html ```c++ bool allclose(ggml_tensor * a, ggml_tensor * b, f32...
Here are some outstanding issues for LoRA: - [x] Base implementation (https://github.com/ggerganov/llama.cpp/pull/820) - [ ] Improve LoRA application time with SIMD (AVX, AVX2) (https://github.com/ggerganov/llama.cpp/issues/956) - [ ] Improve LoRA loading...
Fixes: https://github.com/ggerganov/llama.cpp/issues/932 Hyperthreading is bad, probably because we are compute bound (not memory bound). See also: https://github.com/ggerganov/llama.cpp/issues/34 Notes: I consulted GPT4 in the making of this PR.
**Context:** [LoRA](https://arxiv.org/abs/2106.09685) requires computing $W' = W + \Delta W$ where $\Delta W = BA^T$ and $A,B$ are tall and skinny. See https://github.com/ggerganov/llama.cpp/pull/820 for the use-case. **Problem** The estimated FLOPs...
Preliminary results show that `llama.cpp` is 1.5x-2x _slower_ than `llama-rs`. They were both checked to compile with the same arch flags and use the same gnu toolchain. Summary (on `Vicuna...
For instance, should llama.cpp: 1. support an embedded vector-similarity knowledge base? 2. support other models for multimodality (similar to GPT4). (See e.g. [CLIP-based](https://github.com/facebookresearch/llama/issues/258)) It doesn't have be part of the...
https://github.com/ggerganov/llama.cpp/blob/8b679987cdce292ff36bd741f6715e4927e26f9b/llama.cpp#L1558 Is currently single threaded. Quantization is quite slow (vicuna 7B: 65156.31 ms, vicuna 13B: 129902.48 ms).
With https://github.com/ggerganov/llama.cpp/pull/820, the loaded model may be different from the base model. It makes sense to be able to interactively export the currently loaded model to binfile. Especially if one...
Once https://github.com/ggerganov/llama.cpp/pull/820 is merged, it would be nice to allow linearly interpolating one or multiple LoRA. LoRA should be able to be loaded interactively, and interpolation weights also adjusted interactively.