DeepSpeed questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ?

i just use deepspeed ZeroQuant to compress my model ,but i dont known how to use deepspeed to inference it .Is there any discribe about it ?

Sep 07 '22 05:09 xk503775229

Is there any guide to running inference on compressed models(especially ZeroQuant)? Any help would be appreciated.

Sep 07 '22 06:09 xk503775229

@xk503775229 https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/339 This PR adds support for BLOOM ds-inference with fp16 and int8. The README is not up-to-date yet. I will work on fixing that.

Sep 07 '22 19:09 mayank31398

@mayank31398 when i use the BLOOM way to load my checkpoint ,

GPT2 checkpoint type is not supported

I was trying out the compression library for ZeroQuant quantization (for GPT-2 model). While I was able to compress the model, I didn't see any throughput/latency gain from the quantization during inference Any help would be appreciated.

Sep 16 '22 04:09 xk503775229

Can you share a code snippet you used for loading GPT? Also, currently, DS-inference uses fp16 special CUDA kernels for inference which is not the case for int8. int8 CUDA kernels will be released later which are much faster than fp16.

Sep 16 '22 06:09 mayank31398

Many thanks The following is my code snippet used for loading GPT.

checkpoint.json:

{"type": "GPT2", "checkpoints": ["/root/DeepSpeedExamples/model_compression/gpt2/output/W8A8/pytorch_model.bin"], "version": 1.0}

Sep 16 '22 06:09 xk503775229

In general, the code is only supposed to work with Megatron checkpoints. But there is an exception for BLOOM. Not sure about the reason. @jeffra can you comment? I am not sure, I see the following in the DeepSpeed code: https://github.com/microsoft/DeepSpeed/blob/cf638be99803682933cb4040850765d46832ee78/deepspeed/runtime/state_dict_factory.py#L22-L46

Sep 16 '22 11:09 mayank31398

Hi @xk503775229,

Thanks for the interest in trying Int8 for other models. In general, you should be able to do so, however, one issue here is that you want to use this with loading checkpoint, which is not currently supported for all models. Regarding the Int8 inference, have you tried using the init_inference simply with passing int8 without a checkpoint json (let the model be loaded as it was originally with fp16)? Thanks, Reza

Sep 19 '22 16:09 RezaYazdaniAminabadi

Can you share a code snippet you used for loading GPT? Also, currently, DS-inference uses fp16 special CUDA kernels for inference which is not the case for int8. int8 CUDA kernels will be released later which are much faster than fp16.

hi, is there a timeline for the release of the int8 CUDA kernels?

Oct 20 '22 18:10 david-macleod

Hi This will be released as a part of (MII-Azure) later: https://github.com/microsoft/DeepSpeed-MII

Nov 04 '22 21:11 yaozhewei

Closed for now. Please re-open it if you need further assistance

Nov 11 '22 19:11 yaozhewei

DeepSpeed DeepSpeed copied to clipboard

questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ?

DeepSpeed
DeepSpeed copied to clipboard