sparseml
sparseml copied to clipboard
GPTQ UX config groups support
This PR enhances the user experience of the GPTQModifier by allowing it to directly accept quantization-related arguments, such as config_groups. This change simplifies the configuration process, enabling users to specify a single GPTQModifier instead of combining both a QuantizationModifier and a GPTQModifier into a recipe.
Key Changes
- Direct Argument Acceptance:
GPTQModifiernow accepts quantization-related arguments directly, facilitating easier and more direct configuration. - Enhanced Control: This update exposes more fine-grained control of quantization settings to the users, improving usability and customization.
Implementation Details
Under the hood, a vLLMQuantizationModifier is initialized with:
config_groupsignorenum_calibration_samplesdisable_observer_epoch
Example Configurations
Old Configuration:
# Example of the previous complex setup
test_stage:
obcq_modifiers:
vLLMQuantizationModifier:
ignore: [...]
config_groups:
group_0:
targets: ["Linear"]
# Further settings...
GPTQModifier:
# Additional settings...
New Simplified Configuration:
# Simplified setup with integrated quantization settings
test_stage:
obcq_modifiers:
GPTQModifier:
ignore: [...]
config_groups:
group_0:
targets: ["Linear"]
# Further settings...
# Additional simplified settings...
End-to-End Script Example
Recipe:
# local/feature/gptq_ux/recipes/recipe_config_groups.yaml
test_stage:
obcq_modifiers:
GPTQModifier:
ignore: ["LlamaRotaryEmbedding", "LlamaRMSNorm", "SiLUActivation", "MatMulLeftInput_QK", "MatMulRightInput_QK", "MatMulLeftInput_PV", "MatMulRightInput_PV", "MatMulOutput_QK", "MatMulOutput_PV", "lm_head", "Embedding"]
sequential_update: True
dampening_frac: 0.001
block_size: 128
config_groups:
group_0:
targets: ["Linear"]
input_activations: null
output_activations: null
weights:
num_bits: 8
type: "int"
symmetric: true
strategy: "tensor"
group_size: 128
# local/feature/get_quant_model.py
from pathlib import Path
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot
import argparse
from datetime import datetime
tinyllama_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tiny_random_llama_stub = "HuggingFaceH4/tiny-random-LlamaForCausalLM"
parser = argparse.ArgumentParser(description="Get Quant Model")
parser.add_argument('--recipe', default="/root/projects/sparseml/local/feature/recipe.yaml", help='Path to the recipe')
parser.add_argument('--model_stub', default=tinyllama_stub, help='Model stub')
parser.add_argument('--dataset', default="open_platypus", help='Dataset name')
parser.add_argument('--max_seq_length', type=int, default=512, help='Maximum sequence length')
parser.add_argument('--output_dir', default=None, help='Output directory')
parser.add_argument('--num_calibration_samples', type=int, default=512, help='Number of calibration samples')
parser.add_argument('--overwrite_output_dir', action='store_true', help='Overwrite output directory')
parser.add_argument('--small', action='store_true', help='Use a small model')
args = parser.parse_args()
def get_save_dir_name(model_stub):
dir_name = f"{model_stub.split('/')[-1]}_{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
return str(Path("output") / dir_name)
recipe = args.recipe
model_stub = tiny_random_llama_stub if args.small else args.model_stub
dataset = args.dataset
max_seq_length = args.max_seq_length
output_dir = args.output_dir or get_save_dir_name(model_stub)
num_calibration_samples = args.num_calibration_samples
device = "cuda"
oneshot(
model=model_stub,
dataset=dataset,
overwrite_output_dir=True,
output_dir=output_dir,
max_seq_length=max_seq_length,
num_calibration_samples=num_calibration_samples,
recipe=recipe,
oneshot_device=device,
)
# try reloading the model
model_new = SparseAutoModelForCausalLM.from_pretrained(output_dir)
print("Model reloaded successfully!")
Output
Command
python local/feature/get_quant_model.py --small \
--recipe local/feature/gptq_ux/recipes/recipe_config_groups.yaml
STDOUT
# Output from running the example command
2024-05-09 20:45:40 sparseml.transformers.finetune.session_mixin INFO ...
Model reloaded successfully!
Do we still need the ignore list if we have a targets list - would be great if we didn't need architecture specific ignores like LlamaRMSNorm?
Side note: vLLMQuantizationModifier is a dangerous name to keep around, I would prefer if we didn't keep this as a modifier
Do we still need the ignore list if we have a targets list - would be great if we didn't need architecture specific ignores like
LlamaRMSNorm?Side note:
vLLMQuantizationModifieris a dangerous name to keep around, I would prefer if we didn't keep this as a modifier
Yeah we can safely delete the ignore list, we only need to add a module to the ignore list if it would otherwise we covered by one of the config groups.
The vLLMQuantizationModifier vs regular QuantizationModifier is just to differentiate between the old and new quantization frameworks for now. We're going to get rid of the old framework soon, and at that point can rename the modifier. But if the name itself is an immediate problem sure we can change it