transformers-bloom-inference
transformers-bloom-inference copied to clipboard
Sharding a model checkpoint for deepspeed usage
Hey! I'm using a custom version of this repo to run BLOOM-175B with DeepSpeed and it works great, thank you for this! I was thinking of exploring using large models (such as OPT-175B) and was wondering what is the process for creating a pre-sharded, int8 deepspeed checkpoint for it, similar to https://huggingface.co/microsoft/bloom-deepspeed-inference-int8 Is there any documentation available or example scripts for this?
I am unsure about OPT's compatibility with deepspeed.
But if it works, you can simply pass save_mp_checkpoint_path
parameter to init_inference method.
This will create a pre-sharded fp16 version (assuming it works :) )
For generating int8 weights (pre-sharded), look at https://github.com/microsoft/DeepSpeedExamples/blob/master/model_compression/gpt2/bash_script/run_zero_quant.sh
This scripts generates a quantized version of gpt2, but it is QAT and requires training. I haven't personally tried this though.
Also watch out for https://github.com/huggingface/transformers-bloom-inference/pull/37
If you don't have memory constraints (number of GPUs), I will encourage you to use fp16 since it is faster. int8/int4 will be much faster once DeepSpeed starts supporting their kernels.