Diego Devesa
Diego Devesa
Currently it has some limitations that I suspect would prevent using it with backends that only implement a small subset of operations. However, it was designed from the first moment...
Can you share specific instructions to reproduce the `nan` issue? Ideally with the smallest model that you are aware that has the issue. Or run a `git bisect` to find...
I was able to reproduce the issue with `wikitext-2-raw/wiki.test.raw` at chunk 9 such as: ``` ./imatrix -ofreq 20 -t 4 -ngl 20 -mg 0 -m models/BagelMix-8x7B.Q8_0.gguf -o imatrix-bagelmix.dat -f wikitext-2-raw/wiki.test.raw...
This does not happen out of interactive mode, so I don't think that this is right. This would mean 1.5 GB more memory usage for 2048 context size so it...
Increasing batch size also makes llama.cpp run out of memory, so any solution that only considers the context size and not the batch size is likely wrong.
Thinking more about this, does it really matter what is the initial value of ```buf_size```, as long as it is big enough for the first dummy call to ```llama_eval```? This...
I think this is a good change, but I am concerned that we won't be able to change QK without breaking backwards compatibility, and if we ever want to support...
Right, I was thinking of @qwopqwop200's implementation that showed some benefits from using a group size of 128 (if I understood that correctly). But as you say that is a...
Originally this is from https://github.com/ggerganov/llama.cpp/commit/8f8c28e89cb9531211783da697d6e7c445e2af1d. My guess is that it was done this way due to the directory structure of the original llama-2 distribution files.
https://rocm.docs.amd.com/projects/radeon/en/latest/docs/limitations.html