GPULlama3.java Add Q4_0 quantization support for all models in TornadoVM path

Add Q4_0 quantization support for all models in TornadoVM path

Open mikepapadim opened this issue 3 weeks ago • 1 comments

Implement complete Q4_0 quantization support following the same pattern as Q8_0:

Core Q4_0 Infrastructure:

Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization
Implement Q4_0LayerPlanner base class for all Q4_0 planners
Add LogitsQ4_0Layer shared across all models
Update ModelLoader to handle Q4_0 tensor creation and loading

Model-Specific Q4_0 Implementations:

Add LlamaQ4_0LayerPlanner and LlamaQ4_0FFNLayers (also supports Mistral)
Add Qwen2Q4_0LayerPlanner and Qwen2Q4_0FFNLayers (also supports DeepSeek R1 Distill)
Add Qwen3Q4_0LayerPlanner and Qwen3Q4_0FFNLayers
Add Phi3Q4_0LayerPlanner and Phi3Q4_0FFNLayers

Factory and Loader Updates:

Update QuantizationPlannerFactory to route Q4_0 requests to appropriate planners
Update all model loaders (Llama, Qwen2, Qwen3, Phi3, Mistral) to accept Q4_0

Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining inference accuracy through per-block quantization with FP16 scale factors. Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)

Nov 13 '25 13:11 mikepapadim

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Nov 13 '25 13:11 CLAassistant

GPULlama3.java GPULlama3.java copied to clipboard

Add Q4_0 quantization support for all models in TornadoVM path

GPULlama3.java
GPULlama3.java copied to clipboard