Akarshan Biswas
Akarshan Biswas
Hopefully, My prompt progress implementation will solve it, especially with a low batch size to run on weak CPUs. If it doesn't, this issue can be revisited. I am pushing...
I think this issue has been resolved as we have configurable timeout settings in llama.cpp. Please let us know if it still doesn't work. For now, closing it.
Need UI design, done from the backend side.
> I believe the SYCL Q4_0 reorder optimizations resulted in this as setting GGML_SYCL_DISABLE_OPT=1 allowed things to run normally again cc @Rbiessy @NeoZhangJianyu @Alcpz ^
Just to confirm, gemma2 's window size is hard coded right?
9B-IT is working great and now I can increase the ctx size. :)
Just to mention here, when I was converting the HF gemma2 to bft16 gguf, I noticed that the norm tensors were converted to fp16 instead of directly copying them from...
This sounds an issue when calling the set_tensor function inside ggml-vulkan.cpp. However the message is very cryptic and doesn't provide much information. My suggestion is to run with set VK_LOG_DEBUG=1...
For nonstreaming data, The reasoning is in choices[0].message.reasoning_content in deepseek format. For streaming data, it's in choices[0].delta.reasoning_content. This structure depends on model's chat template. Not all reasoning models do this...
Currently, import() in the new llamacpp extension does similar, the model file stays in its original location. We can introduce recursive 'model folder' import() to achieve similar to what has...