sparseml icon indicating copy to clipboard operation
sparseml copied to clipboard

Quantization Compressor Support

Open Satrat opened this issue 1 year ago • 6 comments

Requires this compressed-tensors branch: https://github.com/neuralmagic/compressed-tensors/pull/45

  • Adds support for saving compressed quantized models within SparseAutoModel saving. Compression type can be passed in via quantization_format or inferred from the model itself
  • Simplified a lot of the save/load logic by moving it to helper classes in compressed-tensors

Examples

Very little UX change, similar to sparsity we just pass save_compressed=True to enable compression. By default, we save weights in the fake_quant format is save_compressed isn't set.

from sparseml.transformers import SparseAutoModelForCausalLM, oneshot

recipe="tests/sparseml/transformers/compression/recipes/new_quant_full.yaml"
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
dataset = "open_platypus"
max_seq_length = 512
num_calibration_samples = 512
output_dir = "./test_updated_llama1.1b_quant_compressed"

model = SparseAutoModelForCausalLM.from_pretrained(model_stub, device_map="cuda:0")
oneshot(
    model=model,
    dataset=dataset,
    overwrite_output_dir=True,
    output_dir=output_dir,
    max_seq_length=max_seq_length,
    num_calibration_samples=num_calibration_samples,
    recipe=recipe,
    pad_to_max_length=False,
    save_compressed=True
)

Reloading a fake_quant model and then compressing it:

from sparseml.transformers import SparseAutoModelForCausalLM

output_dir_fake = "./test_updated_llama1.1b_quant"
output_dir_compressed = "./test_updated_llama1.1b_quant_compressed"

model_reloaded = SparseAutoModelForCausalLM.from_pretrained(output_dir_fake)
model_reloaded.save_pretrained(output_dir_compressed, save_compressed=True)

You can also specify a quantization compression format by name. Right now we only have support for unpacked int quantization, but as we add additional compression formats for quantization this becomes more relevant

from sparseml.transformers import SparseAutoModelForCausalLM

output_dir_fake = "./test_updated_llama1.1b_quant"
output_dir_compressed = "./test_updated_llama1.1b_quant_compressed"

model_reloaded = SparseAutoModelForCausalLM.from_pretrained(output_dir_fake)
model_reloaded.save_pretrained(output_dir_compressed, quantization_format="int_quantized")

Satrat avatar Apr 30 '24 18:04 Satrat

What would happen in the following case?

from sparseml.transformers import SparseAutoModelForCausalLM, oneshot

recipe="tests/sparseml/transformers/compression/recipes/old_quant_full.yaml"
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
dataset = "open_platypus"
max_seq_length = 512
num_calibration_samples = 512
output_dir = "./test_updated_llama1.1b_quant_compressed"

model = SparseAutoModelForCausalLM.from_pretrained(model_stub, device_map="cuda:0")
oneshot(
    model=model,
    dataset=dataset,
    max_seq_length=max_seq_length,
    num_calibration_samples=num_calibration_samples,
    recipe=recipe,
    pad_to_max_length=False,
)

model.save_pretrained(output_dir, save_compressed=True)

robertgshaw2-redhat avatar May 01 '24 21:05 robertgshaw2-redhat

What would be a case where a user would want to specify a quant_format in save_pretrained rather than compressed=True?

robertgshaw2-redhat avatar May 01 '24 21:05 robertgshaw2-redhat

I think this looks very good. I like how it comes through via the save_pretrained method, which is very HF-naive!

Thoughts

  • I think we should have save_compressed=True be the default. With quantization, we are going to need SparseML or vLLM to be able to consume these models anyways, I do not really see an advantage to serializing to fakequant

robertgshaw2-redhat avatar May 01 '24 21:05 robertgshaw2-redhat

What would happen in the following case?

from sparseml.transformers import SparseAutoModelForCausalLM, oneshot

recipe="tests/sparseml/transformers/compression/recipes/old_quant_full.yaml"
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
dataset = "open_platypus"
max_seq_length = 512
num_calibration_samples = 512
output_dir = "./test_updated_llama1.1b_quant_compressed"

model = SparseAutoModelForCausalLM.from_pretrained(model_stub, device_map="cuda:0")
oneshot(
    model=model,
    dataset=dataset,
    max_seq_length=max_seq_length,
    num_calibration_samples=num_calibration_samples,
    recipe=recipe,
    pad_to_max_length=False,
)

model.save_pretrained(output_dir, save_compressed=True)

This will save an uncompressed model to output_dir, then overwrite it with the compressed version in the call save_pretrained

Satrat avatar May 02 '24 14:05 Satrat

What would be a case where a user would want to specify a quant_format in save_pretrained rather than compressed=True?

Right now there isn't a use case, once we add in more compression formats this would allow a user to specify how they want to compress the model. For instance if they wanted to save a model compressed to one integer per weight vs packing the weights

Satrat avatar May 02 '24 14:05 Satrat

I think this looks very good. I like how it comes through via the save_pretrained method, which is very HF-naive!

Thoughts

  • I think we should have save_compressed=True be the default. With quantization, we are going to need SparseML or vLLM to be able to consume these models anyways, I do not really see an advantage to serializing to fakequant

My hesitation with this is once we compress the model we lose information on the original weights. In fakequant we retain the original uncompressed weights and can continue to recalibrate or run GPTQ on the model. Once its compressed we can't do this

Satrat avatar May 02 '24 14:05 Satrat