transformers-bloom-inference icon indicating copy to clipboard operation
transformers-bloom-inference copied to clipboard

Sharding a model checkpoint for deepspeed usage

Open CoderPat opened this issue 2 years ago • 3 comments

Hey! I'm using a custom version of this repo to run BLOOM-175B with DeepSpeed and it works great, thank you for this! I was thinking of exploring using large models (such as OPT-175B) and was wondering what is the process for creating a pre-sharded, int8 deepspeed checkpoint for it, similar to https://huggingface.co/microsoft/bloom-deepspeed-inference-int8 Is there any documentation available or example scripts for this?

CoderPat avatar Dec 05 '22 15:12 CoderPat

I am unsure about OPT's compatibility with deepspeed. But if it works, you can simply pass save_mp_checkpoint_path parameter to init_inference method. This will create a pre-sharded fp16 version (assuming it works :) )

For generating int8 weights (pre-sharded), look at https://github.com/microsoft/DeepSpeedExamples/blob/master/model_compression/gpt2/bash_script/run_zero_quant.sh

This scripts generates a quantized version of gpt2, but it is QAT and requires training. I haven't personally tried this though.

mayank31398 avatar Dec 05 '22 17:12 mayank31398

Also watch out for https://github.com/huggingface/transformers-bloom-inference/pull/37

mayank31398 avatar Dec 05 '22 17:12 mayank31398

If you don't have memory constraints (number of GPUs), I will encourage you to use fp16 since it is faster. int8/int4 will be much faster once DeepSpeed starts supporting their kernels.

mayank31398 avatar Dec 05 '22 17:12 mayank31398