Diego Devesa issues

Results 24 issues of


                                            Diego Devesa

Consider a way to represent color spaces

We need a way to represent color spaces. It may be desirable to have color spaces defined independantly of color formats, ideally allowing us to have color types defined as...

Differences with the llama tokenizer

In this case the llama.cpp and the llama tokenizers produce different output: ``` main: prompt: 'This is 🦙.cpp' main: number of tokens in prompt = 10 1 -> '' 4013...

bug

Add parameter to ignore end of text token

Adds the --ignore-eos switch which prevents generation of the end of text (eos) token. This can be useful to avoid unexpected terminations in interactive mode and to force the model...

enhancement

Add AVX2 implementation of dequantize_row_q4_0

I couldn't notice a big performance improvement, more testing necessary

Add SIMD implementation of ggml_compute_forward_rms_norm_f32

Using the GGML SIMD macros so hopefully it should work on different architectures, but only tested with AVX 2. Don't expect any meaningful performance improvement, the function is not very...

Add AVX2 implementation of quantize_row_q4_1

Largely based on the AVX2 implementation of quantize_row_q4_0. ``` Run on (16 X 3600 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2...

Add LoRA support

This change allows applying LoRA adapters on the fly without having to duplicate the model files. Instructions: - Obtain the HF PEFT LoRA files `adapter_config.json` and `adapter_model.bin` of a LoRA...

research 🔬

Add NVIDIA cuBLAS support

Adds support for NVIDIA cuBLAS for batched operations. In my system this is significantly faster than OpenBLAS. Build with `LLAMA_CUBLAS`: ``` make clean && LLAMA_CUBLAS=1 make ``` Perplexity seconds per...

Multi-threaded ggml_cpy

Reduces overall LoRA loading times significantly when using a different base model with `--lora-base`, from 32s to 24s in my test case. It also seems to improve the general performance...

Improve cuBLAS performance by dequantizing on the GPU

For me this makes cuBLAS about twice as fast with quantized models. Perplexity seconds per pass | Model | PR | Master | | --- | --- | --- |...