Abdullah Malik

Results 23 comments of Abdullah Malik

This looks really interesting! Having tp support like vllm does would bring some great speed ups!

Any updates on this? This seems like a great way to get a few more % for both ROCm and CUDA!

command is `./llama-server -m /home/ultimis/LLM/Models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 131072 -ngl 999 -b 2048 -ub 2048 -fa --reasoning-format none --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --host 0.0.0.0 --port 8081 -lv 1` -lv 1 is spitting out...

Same issue, removed --reasoning-format none, command is now: ./llama-server -m /home/ultimis/LLM/Models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 131072 -ngl 999 -b 2048 -ub 2048 -fa --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --host 0.0.0.0 --port 8081 -lv 1 Started...

Exact same issue on vulkan: https://gist.github.com/AbdullahMPrograms/15e1ba6a43c26974e97f7a1b897bab2f

Due to the these missing kernels in hipblaslt does that mean this is not fixable? I've begun to notice more and more this stalled generation issue while using GPT-OSS

@ggerganov this fixes the issue for vulkan! Vulkan is still not as performant for text generation as ROCm but at least it works!

Recompiling with -DGGML_CUDA_FORCE_MMQ=ON however has solved the issue for me, I have not yet done any speed testing but it seems to be comparable

Its dual xeon 4110's, I also tried MMTool 5.007 and it was the same thing. I see you were able to see the file names, can you confirm which version...