GPULlama3.java
GPULlama3.java copied to clipboard
Add Q4_0 quantization support for all models in TornadoVM path
Implement complete Q4_0 quantization support following the same pattern as Q8_0:
Core Q4_0 Infrastructure:
- Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization
- Implement Q4_0LayerPlanner base class for all Q4_0 planners
- Add LogitsQ4_0Layer shared across all models
- Update ModelLoader to handle Q4_0 tensor creation and loading
Model-Specific Q4_0 Implementations:
- Add LlamaQ4_0LayerPlanner and LlamaQ4_0FFNLayers (also supports Mistral)
- Add Qwen2Q4_0LayerPlanner and Qwen2Q4_0FFNLayers (also supports DeepSeek R1 Distill)
- Add Qwen3Q4_0LayerPlanner and Qwen3Q4_0FFNLayers
- Add Phi3Q4_0LayerPlanner and Phi3Q4_0FFNLayers
Factory and Loader Updates:
- Update QuantizationPlannerFactory to route Q4_0 requests to appropriate planners
- Update all model loaders (Llama, Qwen2, Qwen3, Phi3, Mistral) to accept Q4_0
Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining inference accuracy through per-block quantization with FP16 scale factors. Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.