coremltools icon indicating copy to clipboard operation
coremltools copied to clipboard

Grouping/Blocking/Per-Channel Palettization Options

Open smpanaro opened this issue 1 year ago • 0 comments

❓Question

I see that the Linear 8-bit quantization supports per-channel scales. Is there any way to achieve something similar with palettization?

I have a large weight tensor and being able to subdivide it (either per-channel, into groups of channels, or blocks of elements) would help reduce the quantization error for only a very minor increase in the number of bits-per-weight (say 4-bit → 4.1 bit).

Two things I've tried to achieve this, specifically for linear layers:

  • Split the weight tensor and palletized each split separately. Then concat them back together at runtime. This fails because the linear layer doesn't take non-const inputs for the weights.
  • Split the weight tensor along the output dimension, create a linear layer for each split, pass the input tensor into each and then concatenate the outputs. This actually works, but it adds a lot of extra ops and makes compiling the model slow if there are a lot of splits. (Visual below if that helps.) This is equivalent to tensor(1,512,768) → Linear(768,768) → tensor(1,512,768)

I'm curious if there is either a feature of palettization that I've missed or a different way to use coremltools to achieve this.

smpanaro avatar Dec 19 '23 03:12 smpanaro