DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

OPT in TP or PP mode

Open volkerha opened this issue 3 years ago • 4 comments

Is there a way to inference OPT models in TensorParallel or PipelineParallel mode?

As I understand:

  • BLOOM uses llm provider which loads the model weights as meta tensors first and then assigns devices during checkpoint loading in ds-inference.

  • OPT uses hf provider with 🤗 pipeline and directly loads checkpoint weights on a specific device.

However, only MP is supported from 🤗 side (using accelerate). Is there a way to inference OPT with llm provider?

volkerha avatar Oct 17 '22 06:10 volkerha

Hi @volkerha, thanks for using MII! If you take a look here, you'll see that regardless of the provider the models are processed by the DeepSpeed Inference Engine. This allows any of the models to be run on multi-GPU setups (using TP). To enable this, just add "tensor_parallel":2 to your mii_config dict passed to mii.deploy(). Some of our examples demonstrate this: https://github.com/microsoft/DeepSpeed-MII/blob/main/examples/local/text-generation-bloom-example.py

mrwyattii avatar Oct 17 '22 16:10 mrwyattii

Here's an example for an OPT model that I just tested on 2 GPUs:

import mii

mii_config = {"dtype": "fp16", "tensor_parallel": 2}
name = "facebook/opt-1.3b"

mii.deploy(
    task="text-generation",
    model=name,
    deployment_name=name + "_deployment",
    mii_config=mii_config,
)

mrwyattii avatar Oct 17 '22 16:10 mrwyattii

I tested facebook/opt-6.7b on 8 GPUs with TP=8, FP16. I takes around 28GB per GPU which looks like it's loading the full model parameters (6.7B * 4 bytes ~= 27GB) on every GPU in FP32 (maybe because fp16 is only applied after model loading?).

volkerha avatar Oct 18 '22 08:10 volkerha

@volkerha you are correct, currently with the huggingface provider, we load the full model onto each GPU here. Once we call deepspeed.init_inference on this line, the model gets split across multiple GPU.

I can see how this would be problematic if you don't have enough memory to load the full model on each GPU. We have a workaround that uses meta-tensors (like with the llm provider), but I don't think it's compatible with how we load other huggingface models. @jeffra thoughts on this?

mrwyattii avatar Oct 18 '22 21:10 mrwyattii

@mrwyattii Hi I'm having CUDA OOM errors when loading a EleutherAI/gpt-neox-20b model onto 8 GPUs with TP=8, FP16. Each GPU has 23GB. Is this expected? and does this mean I should use the meta-tensors workaround you mentioned above to load this model? Thanks!

Tianwei-She avatar Nov 16 '22 22:11 Tianwei-She

@Tianwei-She responded to your other issue with a solution (#99).

@volkerha I've made some changes to how we load models in #105. This doesn't completely address the issue of needing to load multiple copies of a model when using tensor parallelism, but we do have plans to address this further. I'll leave this issue open for now and file it under "Enhancment".

mrwyattii avatar Nov 21 '22 23:11 mrwyattii

#199 adds supported for loading models other than BLOOM (including GPT-NeoX, GPT-J, and OPT) using meta tensors. This resolves the problem of loading the model into memory multiple times.

mrwyattii avatar Jun 01 '23 23:06 mrwyattii