GPULlama3.java icon indicating copy to clipboard operation
GPULlama3.java copied to clipboard

RMS normalization kernel optimization by fusing the reduction kernel and context mapping kernel

Open yrq0208 opened this issue 1 month ago • 4 comments

The main optimization is done by using all the threads, instead of only the first global thread, to calculate the scaling factor. This avoids thread divergence and the need to sync threads across work groups, since in context mapping, all the threads are used instead of just global thread 0 (sync threads across work groups is difficult to implement in OpenCL since it involves the use of atomic operations and locks, though maybe doable in CUDA with grid sync). By doing this, the reduction kernel and the context mapping kernel can be merged, reducing kernel launch overheads.

Tested with the following models: beehive-llama-3.2-1b-instruct-fp16.gguf, beehive-llama-3.2-3b-instruct-fp16.gguf, DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf, Qwen2.5-0.5B-Instruct-f16.gguf, qwen2.5-1.5b-instruct-fp16.gguf, Qwen3-0.6B-f16.gguf, Qwen3-1.7B-f16.gguf, Qwen3-4B-f16.gguf. 7B and 8B models are not tested due to hardware limitations (5080 only has 16GB of VRAM).

E2E performance (in tok/s) improvement ranges between 5% - 22% on 5080 mobile with TornadoVM OpenCL backend.

yrq0208 avatar Dec 01 '25 14:12 yrq0208

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Dec 01 '25 14:12 CLAassistant

\rerun

mikepapadim avatar Dec 01 '25 15:12 mikepapadim

current issues: OpenCL: Qwen3-4B-f16.gguf, Mistral-7B-Instruct-v0.3.fp16.gguf, Phi-3-mini-4k-instruct-fp16.gguf, Phi-3-mini-4k-instruct-Q8_0.gguf, Mistral-7B-Instruct-v0.3.Q8_0.gguf PTX: Qwen3-4B-f16.gguf, Phi-3-mini-4k-instruct-fp16.gguf, Qwen3-0.6B-Q8_0.gguf, Phi-3-mini-4k-instruct-Q8_0.gguf

yrq0208 avatar Dec 01 '25 17:12 yrq0208

Update: I have tested with both the FP16 and Q8 models, using both OpenCL and PTX backends, the only models I haven't tested are the llama3.2 and qwen3 8B FP16 models, since there isn't enough VRAM in the 5080 for it.

I am unable to reproduce the error/gibberish output from the Mistral 7B FP16/Q8 models (works on my side). I am also not able to reproduce the gibberish output from the Phi3 Q8 model (works on my side). The Phi3 FP16 model still seems bugged in the baseline without my modifications. I am using the latest TornadoVM and Tornado llama builds

yrq0208 avatar Dec 02 '25 15:12 yrq0208

The gibberish output from the qwen3 4B FP16 and Q8 models seems inconsistent on my side, sometimes I can reproduce it, sometimes I cannot. Need to take a closer look

yrq0208 avatar Dec 02 '25 18:12 yrq0208

Looks like most of the CI checks are fine apart from Phi-3-mini-4k-instruct-fp16.gguf running with PTX? Though this seems to be an ongoing issue with the baseline (https://github.com/beehive-lab/GPULlama3.java/actions/runs/19747335165/job/56616800315?pr=75)?

yrq0208 avatar Dec 10 '25 12:12 yrq0208

\rerun

yrq0208 avatar Dec 15 '25 12:12 yrq0208

/rerun all

mikepapadim avatar Dec 15 '25 13:12 mikepapadim

🚀 Workflow rerun started

Mode: all Triggered by: @mikepapadim

View Actions

github-actions[bot] avatar Dec 15 '25 13:12 github-actions[bot]

Workflow rerun success

View Actions

github-actions[bot] avatar Dec 15 '25 13:12 github-actions[bot]

Can you remove the external/tornadovm path? This directory is removed.

done.

yrq0208 avatar Dec 15 '25 17:12 yrq0208