Feature request: Add ASTC weight compression + hardware decoding support

Open 123abcaaa123 opened this issue 5 months ago • 0 comments

🚀 Feature request: Add ASTC weight compression + hardware decoding support

TL;DR
Recent Apple documentation (“Apple Intelligence Foundation Language Models Tech Report 2025”) shows that storing LLM weights as 6 × 6 ASTC blocks (HDR‑ch mode) shrinks storage to ≈ 3.6 bits / weight while inference runs at virtually zero extra latency thanks to the fixed‑function ASTC decoder present in every Apple GPU since A7.
Integrating this into mlx‑lm would give us 4 – 5 × memory‑and‑bandwidth savings on Apple devices with little accuracy loss (< 1 pp on MMLU with a small LoRA‑based recovery).

1 Background & motivation

Memory & bandwidth bottlenecks
mlx‑lm already offers 4‑bit / 2‑bit quantisation and KV‑cache sharing, which is enough for 3 B on‑device models. Larger dense or MoE models are still bandwidth‑limited.
ASTC turns weights into textures; the Apple GPU pipeline can decode them “for free” on every fetch.
Hardware availability
The ASTC decoder is already baked into the texture‑sampling path on iPhone, iPad and Apple Silicon Macs. Metal / MPS can sample .astc textures with one line of shader code.

2 Proposed API (draft)

import mlx
from mlx.experimental import astc

# ① Offline compression
astc.encode_weights(
    model_path="qwen3-8b.safetensors",
    out_dir="qwen3-8b-astc/",
    block_size=(6, 6),          # matches Apple Tech Report
    mode="hdr-ch"               # positive‑only; stores per‑block offset
)

# ② Inference‑time loading (auto‑detect GPU capability)
model = astc.load_astc_weights(
    arch="qwen3-8b",
    astc_dir="qwen3-8b-astc/",
    fallback_dtype="float16"    # decode on CPU if ASTC not supported
)

Jul 25 '25 11:07 123abcaaa123