sparseml
sparseml copied to clipboard
Quantization Compressor Support
Requires this compressed-tensors branch: https://github.com/neuralmagic/compressed-tensors/pull/45
- Adds support for saving compressed quantized models within SparseAutoModel saving. Compression type can be passed in via
quantization_formator inferred from the model itself - Simplified a lot of the save/load logic by moving it to helper classes in compressed-tensors
Examples
Very little UX change, similar to sparsity we just pass save_compressed=True to enable compression. By default, we save weights in the fake_quant format is save_compressed isn't set.
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot
recipe="tests/sparseml/transformers/compression/recipes/new_quant_full.yaml"
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
dataset = "open_platypus"
max_seq_length = 512
num_calibration_samples = 512
output_dir = "./test_updated_llama1.1b_quant_compressed"
model = SparseAutoModelForCausalLM.from_pretrained(model_stub, device_map="cuda:0")
oneshot(
model=model,
dataset=dataset,
overwrite_output_dir=True,
output_dir=output_dir,
max_seq_length=max_seq_length,
num_calibration_samples=num_calibration_samples,
recipe=recipe,
pad_to_max_length=False,
save_compressed=True
)
Reloading a fake_quant model and then compressing it:
from sparseml.transformers import SparseAutoModelForCausalLM
output_dir_fake = "./test_updated_llama1.1b_quant"
output_dir_compressed = "./test_updated_llama1.1b_quant_compressed"
model_reloaded = SparseAutoModelForCausalLM.from_pretrained(output_dir_fake)
model_reloaded.save_pretrained(output_dir_compressed, save_compressed=True)
You can also specify a quantization compression format by name. Right now we only have support for unpacked int quantization, but as we add additional compression formats for quantization this becomes more relevant
from sparseml.transformers import SparseAutoModelForCausalLM
output_dir_fake = "./test_updated_llama1.1b_quant"
output_dir_compressed = "./test_updated_llama1.1b_quant_compressed"
model_reloaded = SparseAutoModelForCausalLM.from_pretrained(output_dir_fake)
model_reloaded.save_pretrained(output_dir_compressed, quantization_format="int_quantized")
What would happen in the following case?
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot
recipe="tests/sparseml/transformers/compression/recipes/old_quant_full.yaml"
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
dataset = "open_platypus"
max_seq_length = 512
num_calibration_samples = 512
output_dir = "./test_updated_llama1.1b_quant_compressed"
model = SparseAutoModelForCausalLM.from_pretrained(model_stub, device_map="cuda:0")
oneshot(
model=model,
dataset=dataset,
max_seq_length=max_seq_length,
num_calibration_samples=num_calibration_samples,
recipe=recipe,
pad_to_max_length=False,
)
model.save_pretrained(output_dir, save_compressed=True)
What would be a case where a user would want to specify a quant_format in save_pretrained rather than compressed=True?
I think this looks very good. I like how it comes through via the save_pretrained method, which is very HF-naive!
Thoughts
- I think we should have
save_compressed=Truebe the default. With quantization, we are going to need SparseML or vLLM to be able to consume these models anyways, I do not really see an advantage to serializing to fakequant
What would happen in the following case?
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot recipe="tests/sparseml/transformers/compression/recipes/old_quant_full.yaml" model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T" dataset = "open_platypus" max_seq_length = 512 num_calibration_samples = 512 output_dir = "./test_updated_llama1.1b_quant_compressed" model = SparseAutoModelForCausalLM.from_pretrained(model_stub, device_map="cuda:0") oneshot( model=model, dataset=dataset, max_seq_length=max_seq_length, num_calibration_samples=num_calibration_samples, recipe=recipe, pad_to_max_length=False, ) model.save_pretrained(output_dir, save_compressed=True)
This will save an uncompressed model to output_dir, then overwrite it with the compressed version in the call save_pretrained
What would be a case where a user would want to specify a
quant_formatinsave_pretrainedrather thancompressed=True?
Right now there isn't a use case, once we add in more compression formats this would allow a user to specify how they want to compress the model. For instance if they wanted to save a model compressed to one integer per weight vs packing the weights
I think this looks very good. I like how it comes through via the
save_pretrainedmethod, which is very HF-naive!Thoughts
- I think we should have
save_compressed=Truebe the default. With quantization, we are going to need SparseML or vLLM to be able to consume these models anyways, I do not really see an advantage to serializing to fakequant
My hesitation with this is once we compress the model we lose information on the original weights. In fakequant we retain the original uncompressed weights and can continue to recalibrate or run GPTQ on the model. Once its compressed we can't do this