Gustavo Rocha Dias

Results 7 comments of Gustavo Rocha Dias

I've made the same observation as you. It's conceivable that the behavior we're noticing is related to how the shared memory interacts between the CPU and GPU. On my personal...

The easiest way is to use KoboldCpp instead of LlamaCpp. It has a minimal KobolAI API that allows connecting with SillyTavern, and it`s a fork of LlamaCpp being constantly updated....

Had the same error, seems like that the CPU RAM is not enough to load the model before sending it to the GPU.

More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux. Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not. GGML models work okay....

> > More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux. > > Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not....

Pointing another thread discussing this topic: #7016

Probably it's a RAG for references. SillyTavern already uses ChromaDB's extension RAG for memories, but this seens to be a different use of it.