Jaap Buurman

Results 101 comments of Jaap Buurman

I mean I am more than happy to close it, as it doesn't really impact me since I will run with flash attention if possible. But isn't this still a...

It's also happening for me on a 7900XTX running on ROCm. I have also tried -ngl 0 (Eg, CPU only), FA enabled/disabled but all with the same result. Interestingly, the...

Example: ![Image](https://github.com/user-attachments/assets/5769d3a2-f1d6-487e-b84d-ae2b05704d66)

Experiencing the same with the 32b model

I was able to solve the issue by increasing the `num_ctx` parameter. Apparently when the context size is exceeded, the model starts spitting out stuff that looks like training data,...

Ollama pre-release 0.4.0 is available here: https://github.com/ollama/ollama/releases/tag/v0.4.0-rc3 The thing that caught my eyes was the following statement: which includes improved vision model caching, model reliability, caching and **stop token detection**...

I am getting the same error with this command: `ollama run hf.co/bartowski/Replete-LLM-V2.5-Qwen-32b-GGUF:IQ4_NL` The IQ4_NL quant does exist in the repo, and is a valid and normal quant option though: https://huggingface.co/bartowski/Replete-LLM-V2.5-Qwen-32b-GGUF...

Not sure, is it? I have opened an issue about it here: https://github.com/ollama/ollama/issues/7365 Someone else is having the same issue. IQ4_NL is a quant that should be supported, but it...

Actually, it might not be related to MOE models, but to gpt-oss-120b (either the model architecture or its special quant) specifically. If I run Qwen3-30B-A3B q8_0 I get the following...

> [@Mushoz](https://github.com/Mushoz) could you check if [#15363](https://github.com/ggml-org/llama.cpp/pull/15363) give you better speed? I see the same negative scaling at batchsize 2 & 3, and overall performance is ever so slightly lower...