Matvey Soloviev
Matvey Soloviev
Requesting review; it seems that it wouldn't actually let me review and merge myself (...even though I possibly could have merged locally and pushed straight to master?)
Thank you! @FNsi, is this the Q4_1 quantification you are testing? The filename says models/30B/ggml-model-q4_0.bin. How is the performance at 30B?
I was independently trying to do something similar on the Q4_1 code [here](https://github.com/ggerganov/llama.cpp/tree/q4_1_more_accel). I managed to squeeze out somewhere around 5% more performance by rearranging the SIMD math and avoiding...
@ggerganov Thanks for paging me in! As far as I can see, there would be no conflict with anything I'm doing. I think this is a good change in terms...
I think it's good to not force interactive mode immediately (in fact that was how it worked when I first made the patch, but the logic seems to have changed...
@tjohnman Thanks! Wasn't meaning to imply you had anything to do with the removal - development has been moving quickly and chaotically, it probably just fell on the wayside in...
I'm playing around with local search for the q4_1 parameters now, with something like the following approximately in place of the inner loop of `quantize_row_q4_1`: ```c round_block(pp, x + i*QK,...
I tried to run the previously mentioned Q4_1 quantization method with some number of local relaxation steps to reduce the square error (down to 83% of the naive computation's error...
@anzz1 The reverse prompt can span multiple tokens. However, there is no way for it to interrupt generation mid-token. (That's why I opted to use token vectors rather than strings...
What directory are you running it from, and what is the directory you moved the json file (which one?) to? Have you run `make install`? Notekit should be capable of...