DeepSpeed support autoTP with weight only quantization in DS inference path

This PR is used to make weight only quantization work with autoTP.

The sample code is like below:

    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)

    ds_model = deepspeed.init_inference(model,
                                        mp_size=world_size,
                                        dtype=torch.float16,
                                        replace_with_kernel_inject=False) 

    model = ds_model.module
    from deepspeed.inference.quantization.quantization import _init_group_wise_weight_quantization
    ds_config = {
        "weight_quantization": {
            "post_init_quant": {
                '*': {
                    'num_bits': 4,
                    'group_size': 32,
                    'group_dim': 1,
                    'symmetric': False
                },
            }
        }
    }
    model = _init_group_wise_weight_quantization(model, ds_config)

by this way, user can enable WOQ on multiple cards.

Nov 29 '23 05:11 ftian1

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Nov 30 '23 04:11 delock

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

Dec 01 '23 08:12 ftian1

It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

Dec 02 '23 01:12 delock

@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection?

Dec 19 '23 01:12 delock

DeepSpeed DeepSpeed copied to clipboard

support autoTP with weight only quantization in DS inference path

DeepSpeed
DeepSpeed copied to clipboard