mlx icon indicating copy to clipboard operation
mlx copied to clipboard

Feature request: Add ASTC weight compression + hardware decoding support

Open 123abcaaa123 opened this issue 5 months ago • 0 comments

🚀 Feature request: Add ASTC weight compression + hardware decoding support

TL;DR
Recent Apple documentation (“Apple Intelligence Foundation Language Models Tech Report 2025”) shows that storing LLM weights as 6 × 6 ASTC blocks (HDR‑ch mode) shrinks storage to ≈ 3.6 bits / weight while inference runs at virtually zero extra latency thanks to the fixed‑function ASTC decoder present in every Apple GPU since A7.
Integrating this into mlx‑lm would give us 4 – 5 × memory‑and‑bandwidth savings on Apple devices with little accuracy loss (< 1 pp on MMLU with a small LoRA‑based recovery).


1 Background & motivation

  • Memory & bandwidth bottlenecks
    mlx‑lm already offers 4‑bit / 2‑bit quantisation and KV‑cache sharing, which is enough for 3 B on‑device models. Larger dense or MoE models are still bandwidth‑limited.
    ASTC turns weights into textures; the Apple GPU pipeline can decode them “for free” on every fetch.

  • Hardware availability
    The ASTC decoder is already baked into the texture‑sampling path on iPhone, iPad and Apple Silicon Macs. Metal / MPS can sample .astc textures with one line of shader code.


2 Proposed API (draft)

import mlx
from mlx.experimental import astc

# ① Offline compression
astc.encode_weights(
    model_path="qwen3-8b.safetensors",
    out_dir="qwen3-8b-astc/",
    block_size=(6, 6),          # matches Apple Tech Report
    mode="hdr-ch"               # positive‑only; stores per‑block offset
)

# ② Inference‑time loading (auto‑detect GPU capability)
model = astc.load_astc_weights(
    arch="qwen3-8b",
    astc_dir="qwen3-8b-astc/",
    fallback_dtype="float16"    # decode on CPU if ASTC not supported
)

123abcaaa123 avatar Jul 25 '25 11:07 123abcaaa123