quanto
quanto copied to clipboard
Quanto scale values seem unpopulated in quantized model
When loading a mistral model I noticed that the output_scale
and input_scale
values associated with the quantized tensors were just tensors with the value 1, i.e. tensor(1., device='cuda:0')
This seems incorrect, since the model seems to be quantized correctly, and I would expect these variables to have the scaling factor that was used? Is there a reason for this behavior?
Here is the code I used:
seed = 1
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
device = None
weights_type = "int8"
activations_type = "none"
torch.manual_seed(seed)
if device is None:
if torch.cuda.is_available():
device = torch.device("cuda")
print("Using cuda device")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
else:
device = torch.device(device)
weights = keyword_to_qtype(weights_type)
activations = keyword_to_qtype(activations_type)
dtype = torch.float32 if device.type == "cpu" else torch.float16
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, low_cpu_mem_usage=True).to(device)
if weights is not None or activations is not None:
print("Quantizing")
start = time.time()
quantize(model, weights=weights, activations=activations)
# freeze(model)
print(f"Finished: {time.time()-start:.2f}")
The input and output scales are only used when activations are not None. They are set by default to 1.0, and can only be updated by going through a calibration phase. See https://github.com/huggingface/quanto?tab=readme-ov-file#quantization-workflow
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.