DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

support autoTP with weight only quantization in DS inference path

Open ftian1 opened this issue 1 year ago • 4 comments

This PR is used to make weight only quantization work with autoTP.

The sample code is like below:

    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)

    ds_model = deepspeed.init_inference(model,
                                        mp_size=world_size,
                                        dtype=torch.float16,
                                        replace_with_kernel_inject=False) 

    model = ds_model.module
    from deepspeed.inference.quantization.quantization import _init_group_wise_weight_quantization
    ds_config = {
        "weight_quantization": {
            "post_init_quant": {
                '*': {
                    'num_bits': 4,
                    'group_size': 32,
                    'group_dim': 1,
                    'symmetric': False
                },
            }
        }
    }
    model = _init_group_wise_weight_quantization(model, ds_config)

by this way, user can enable WOQ on multiple cards.

ftian1 avatar Nov 29 '23 05:11 ftian1

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

delock avatar Nov 30 '23 04:11 delock

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

ftian1 avatar Dec 01 '23 08:12 ftian1

It should be better to detect custom kernel existance by check attribute of the loaded ops, and call custom kernel accordingly, so any accelerator implement these kernels would be plugged.

@ftian1 if accelerator other than CUDA want to support AutoTP WOQ, which set of OpBuilder/kernels needs to be implemented? Can you provide a link to kernel usage in the code?

Here is the link https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/quantization/utils.py#L115-L124, in which it dispatch to deepspeed.ops.op_builder.QuantizerBuilder() if device is cuda and group_size is able to be divided by 8. such implementation could be enhanced to support different h/w's opbuilder

delock avatar Dec 02 '23 01:12 delock

@ftian1 Is usage of WoQ with AutoTP similiar to with kernel injection? Can you post a sample code show WoQ in DeepSpeed looks like withy kernel injection?

delock avatar Dec 19 '23 01:12 delock