DeepSpeed
DeepSpeed copied to clipboard
support autoTP with weight only quantization in DS inference path
This PR is used to make weight only quantization work with autoTP.
The sample code is like below:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)
ds_model = deepspeed.init_inference(model,
mp_size=world_size,
dtype=torch.float16,
replace_with_kernel_inject=False)
model = ds_model.module
from deepspeed.inference.quantization.quantization import _init_group_wise_weight_quantization
ds_config = {
"weight_quantization": {
"post_init_quant": {
'*': {
'num_bits': 4,
'group_size': 32,
'group_dim': 1,
'symmetric': False
},
}
}
}
model = _init_group_wise_weight_quantization(model, ds_config)
by this way, user can enable WOQ on multiple cards.
@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?
@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?
Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124,
in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder()
if device is cuda
and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder
It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.
@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?
Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, in which it dispatch to
deepspeed.ops.op_builder.QuantizerBuilder()
if device iscuda
and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder
@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection?