Diego Devesa
Diego Devesa
It doesn't compile as is because you have made the constant a local in `llama_model_load()`, and this means that it is not visible in `main()` where it is used. For...
Here is a suggestion: notice that the token is generated in main.cpp:~1003, in this line: ```C++ id = llama_sample_top_p_top_k(vocab, logits.data() + (logits.size() - n_vocab), last_n_tokens, repeat_penalty, top_k, top_p, temp, rng);...
> Correct me if I'm wrong, but wouldn't this be getting very close to what the --ignore-eos argument does? Not entirely, --ignore-eos prevents eos from being sampled at all in...
Ah, I think you also need to add `is_interacting = true;` to force it to return control to the user.
The other approach allows it to generate one more token before returning control to the user (since the logic happens at the end of the loop). In some cases it...
I have suggested a change at https://github.com/rabidcopy/llama.cpp/pull/2 (please check it out @rabidcopy) that would return control to the user by injecting the anti-prompt instead, which should solve that problem. I...
A quick performance test shows significant improvement in the function itself (with k=4096): ``` Running ./test-dq Run on (16 X 3600 MHz CPU s) CPU Caches: L1 Data 32 KiB...
The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values....
It's a standalone test using the google benchmark library. Here is the code: https://gist.github.com/slaren/ba732ed08abd0ba148129eab3335dfb7 To do that, I split the avx and scalar implementations into `dequantize_row_q4_0_avx2` and `dequantize_row_q4_0` beforehand.
@ggerganov that's not what I am seeing, here is a stack trace for example: ``` #2 0x00005555555660d4 in dequantize_row_q4_0 (x=0x7ffedc43a0d0, y=0x7ffe585b81a0, k=k@entry=4096) at ggml.c:767 #3 0x000055555556b1e7 in ggml_compute_forward_get_rows_q4_0 (params=, params=,...