GPULlama3.java RMS normalization kernel optimization by fusing the reduction kernel and context mapping kernel

The main optimization is done by using all the threads, instead of only the first global thread, to calculate the scaling factor. This avoids thread divergence and the need to sync threads across work groups, since in context mapping, all the threads are used instead of just global thread 0 (sync threads across work groups is difficult to implement in OpenCL since it involves the use of atomic operations and locks, though maybe doable in CUDA with grid sync). By doing this, the reduction kernel and the context mapping kernel can be merged, reducing kernel launch overheads.

Tested with the following models: beehive-llama-3.2-1b-instruct-fp16.gguf, beehive-llama-3.2-3b-instruct-fp16.gguf, DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf, Qwen2.5-0.5B-Instruct-f16.gguf, qwen2.5-1.5b-instruct-fp16.gguf, Qwen3-0.6B-f16.gguf, Qwen3-1.7B-f16.gguf, Qwen3-4B-f16.gguf. 7B and 8B models are not tested due to hardware limitations (5080 only has 16GB of VRAM).

E2E performance (in tok/s) improvement ranges between 5% - 22% on 5080 mobile with TornadoVM OpenCL backend.

Dec 01 '25 14:12 yrq0208

All committers have signed the CLA.

Dec 01 '25 14:12 CLAassistant

\rerun

Dec 01 '25 15:12 mikepapadim

current issues: OpenCL: Qwen3-4B-f16.gguf, Mistral-7B-Instruct-v0.3.fp16.gguf, Phi-3-mini-4k-instruct-fp16.gguf, Phi-3-mini-4k-instruct-Q8_0.gguf, Mistral-7B-Instruct-v0.3.Q8_0.gguf PTX: Qwen3-4B-f16.gguf, Phi-3-mini-4k-instruct-fp16.gguf, Qwen3-0.6B-Q8_0.gguf, Phi-3-mini-4k-instruct-Q8_0.gguf

Dec 01 '25 17:12 yrq0208

Update: I have tested with both the FP16 and Q8 models, using both OpenCL and PTX backends, the only models I haven't tested are the llama3.2 and qwen3 8B FP16 models, since there isn't enough VRAM in the 5080 for it.

I am unable to reproduce the error/gibberish output from the Mistral 7B FP16/Q8 models (works on my side). I am also not able to reproduce the gibberish output from the Phi3 Q8 model (works on my side). The Phi3 FP16 model still seems bugged in the baseline without my modifications. I am using the latest TornadoVM and Tornado llama builds

Dec 02 '25 15:12 yrq0208

The gibberish output from the qwen3 4B FP16 and Q8 models seems inconsistent on my side, sometimes I can reproduce it, sometimes I cannot. Need to take a closer look

Dec 02 '25 18:12 yrq0208

Looks like most of the CI checks are fine apart from Phi-3-mini-4k-instruct-fp16.gguf running with PTX? Though this seems to be an ongoing issue with the baseline (https://github.com/beehive-lab/GPULlama3.java/actions/runs/19747335165/job/56616800315?pr=75)?

Dec 10 '25 12:12 yrq0208

\rerun

Dec 15 '25 12:12 yrq0208

/rerun all

Dec 15 '25 13:12 mikepapadim

🚀 Workflow rerun started

Mode: all Triggered by: @mikepapadim

View Actions

Dec 15 '25 13:12 github-actions[bot]

✅ Workflow rerun success

View Actions

Dec 15 '25 13:12 github-actions[bot]

Can you remove the external/tornadovm path? This directory is removed.

done.

Dec 15 '25 17:12 yrq0208