Abdullah Malik
Abdullah Malik
This looks really interesting! Having tp support like vllm does would bring some great speed ups!
Any updates on this? This seems like a great way to get a few more % for both ROCm and CUDA!
command is `./llama-server -m /home/ultimis/LLM/Models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 131072 -ngl 999 -b 2048 -ub 2048 -fa --reasoning-format none --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --host 0.0.0.0 --port 8081 -lv 1` -lv 1 is spitting out...
Same issue, removed --reasoning-format none, command is now: ./llama-server -m /home/ultimis/LLM/Models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 131072 -ngl 999 -b 2048 -ub 2048 -fa --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --host 0.0.0.0 --port 8081 -lv 1 Started...
Exact same issue on vulkan: https://gist.github.com/AbdullahMPrograms/15e1ba6a43c26974e97f7a1b897bab2f
Due to the these missing kernels in hipblaslt does that mean this is not fixable? I've begun to notice more and more this stalled generation issue while using GPT-OSS
@ggerganov this fixes the issue for vulkan! Vulkan is still not as performant for text generation as ROCm but at least it works!
Recompiling with -DGGML_CUDA_FORCE_MMQ=ON however has solved the issue for me, I have not yet done any speed testing but it seems to be comparable
Its dual xeon 4110's, I also tried MMTool 5.007 and it was the same thing. I see you were able to see the file names, can you confirm which version...