Feature request: Add ASTC weight compression + hardware decoding support
🚀 Feature request: Add ASTC weight compression + hardware decoding support
TL;DR
Recent Apple documentation (“Apple Intelligence Foundation Language Models Tech Report 2025”) shows that storing LLM weights as 6 × 6 ASTC blocks (HDR‑ch mode) shrinks storage to ≈ 3.6 bits / weight while inference runs at virtually zero extra latency thanks to the fixed‑function ASTC decoder present in every Apple GPU since A7.
Integrating this into mlx‑lm would give us 4 – 5 × memory‑and‑bandwidth savings on Apple devices with little accuracy loss (< 1 pp on MMLU with a small LoRA‑based recovery).
1 Background & motivation
-
Memory & bandwidth bottlenecks
mlx‑lmalready offers 4‑bit / 2‑bit quantisation and KV‑cache sharing, which is enough for 3 B on‑device models. Larger dense or MoE models are still bandwidth‑limited.
ASTC turns weights into textures; the Apple GPU pipeline can decode them “for free” on every fetch. -
Hardware availability
The ASTC decoder is already baked into the texture‑sampling path on iPhone, iPad and Apple Silicon Macs. Metal / MPS can sample.astctextures with one line of shader code.
2 Proposed API (draft)
import mlx
from mlx.experimental import astc
# ① Offline compression
astc.encode_weights(
model_path="qwen3-8b.safetensors",
out_dir="qwen3-8b-astc/",
block_size=(6, 6), # matches Apple Tech Report
mode="hdr-ch" # positive‑only; stores per‑block offset
)
# ② Inference‑time loading (auto‑detect GPU capability)
model = astc.load_astc_weights(
arch="qwen3-8b",
astc_dir="qwen3-8b-astc/",
fallback_dtype="float16" # decode on CPU if ASTC not supported
)