Diego Devesa comments

Results 361 comments of


                                            Diego Devesa

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode

It doesn't compile as is because you have made the constant a local in `llama_model_load()`, and this means that it is not visible in `main()` where it is used. For...

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode

Here is a suggestion: notice that the token is generated in main.cpp:~1003, in this line: ```C++ id = llama_sample_top_p_top_k(vocab, logits.data() + (logits.size() - n_vocab), last_n_tokens, repeat_penalty, top_k, top_p, temp, rng);...

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode

> Correct me if I'm wrong, but wouldn't this be getting very close to what the --ignore-eos argument does? Not entirely, --ignore-eos prevents eos from being sampled at all in...

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode

Ah, I think you also need to add `is_interacting = true;` to force it to return control to the user.

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode

The other approach allows it to generate one more token before returning control to the user (since the logic happens at the end of the loop). In some cases it...

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode

I have suggested a change at https://github.com/rabidcopy/llama.cpp/pull/2 (please check it out @rabidcopy) that would return control to the user by injecting the anti-prompt instead, which should solve that problem. I...

Add AVX2 implementation of dequantize_row_q4_0

A quick performance test shows significant improvement in the function itself (with k=4096): ``` Running ./test-dq Run on (16 X 3600 MHz CPU s) CPU Caches: L1 Data 32 KiB...

Add AVX2 implementation of dequantize_row_q4_0

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values....

Add AVX2 implementation of dequantize_row_q4_0

It's a standalone test using the google benchmark library. Here is the code: https://gist.github.com/slaren/ba732ed08abd0ba148129eab3335dfb7 To do that, I split the avx and scalar implementations into `dequantize_row_q4_0_avx2` and `dequantize_row_q4_0` beforehand.

Add AVX2 implementation of dequantize_row_q4_0

@ggerganov that's not what I am seeing, here is a stack trace for example: ``` #2 0x00005555555660d4 in dequantize_row_q4_0 (x=0x7ffedc43a0d0, y=0x7ffe585b81a0, k=k@entry=4096) at ggml.c:767 #3 0x000055555556b1e7 in ggml_compute_forward_get_rows_q4_0 (params=, params=,...