Diego Devesa
Diego Devesa
We need a way to represent color spaces. It may be desirable to have color spaces defined independantly of color formats, ideally allowing us to have color types defined as...
In this case the llama.cpp and the llama tokenizers produce different output: ``` main: prompt: 'This is 🦙.cpp' main: number of tokens in prompt = 10 1 -> '' 4013...
Adds the --ignore-eos switch which prevents generation of the end of text (eos) token. This can be useful to avoid unexpected terminations in interactive mode and to force the model...
I couldn't notice a big performance improvement, more testing necessary
Using the GGML SIMD macros so hopefully it should work on different architectures, but only tested with AVX 2. Don't expect any meaningful performance improvement, the function is not very...
Largely based on the AVX2 implementation of quantize_row_q4_0. ``` Run on (16 X 3600 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2...
This change allows applying LoRA adapters on the fly without having to duplicate the model files. Instructions: - Obtain the HF PEFT LoRA files `adapter_config.json` and `adapter_model.bin` of a LoRA...
Adds support for NVIDIA cuBLAS for batched operations. In my system this is significantly faster than OpenBLAS. Build with `LLAMA_CUBLAS`: ``` make clean && LLAMA_CUBLAS=1 make ``` Perplexity seconds per...
Reduces overall LoRA loading times significantly when using a different base model with `--lora-base`, from 32s to 24s in my test case. It also seems to improve the general performance...
For me this makes cuBLAS about twice as fast with quantized models. Perplexity seconds per pass | Model | PR | Master | | --- | --- | --- |...