GPULlama3.java icon indicating copy to clipboard operation
GPULlama3.java copied to clipboard

Add Q4_0 quantization support for all models in TornadoVM path

Open mikepapadim opened this issue 3 weeks ago • 1 comments

Implement complete Q4_0 quantization support following the same pattern as Q8_0:

Core Q4_0 Infrastructure:

  • Add Q4_0TornadoTensor for GPU tensor representation with 4-bit quantization
  • Implement Q4_0LayerPlanner base class for all Q4_0 planners
  • Add LogitsQ4_0Layer shared across all models
  • Update ModelLoader to handle Q4_0 tensor creation and loading

Model-Specific Q4_0 Implementations:

  • Add LlamaQ4_0LayerPlanner and LlamaQ4_0FFNLayers (also supports Mistral)
  • Add Qwen2Q4_0LayerPlanner and Qwen2Q4_0FFNLayers (also supports DeepSeek R1 Distill)
  • Add Qwen3Q4_0LayerPlanner and Qwen3Q4_0FFNLayers
  • Add Phi3Q4_0LayerPlanner and Phi3Q4_0FFNLayers

Factory and Loader Updates:

  • Update QuantizationPlannerFactory to route Q4_0 requests to appropriate planners
  • Update all model loaders (Llama, Qwen2, Qwen3, Phi3, Mistral) to accept Q4_0

Q4_0 achieves 4x memory compression vs FP16 and 2x vs Q8_0 while maintaining inference accuracy through per-block quantization with FP16 scale factors. Block size: 32 elements, Type size: 18 bytes (2 bytes FP16 + 16 bytes packed 4-bit values)

mikepapadim avatar Nov 13 '25 13:11 mikepapadim

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Nov 13 '25 13:11 CLAassistant