DeepSpeed [Question] How to preshard a model for tensor parallism

Currently we are trying to run inference with pretrained BLOOM model. However, the loading takes very long due to DeepSpeed sharding in runtime. Since there is a pre-sharded version of BLOOM:

microsoft/bloom-deepspeed-inference-fp16

Is that possible to share the script that done the above job or any guidance on how to preshard to speed up loading experience?

Sep 29 '22 20:09 lanking520

Hi @lanking520 for loading this presharded version:

import os
import torch
import deepspeed

from huggingface_hub import snapshot_download

model = "microsoft/bloom-deepspeed-inference-fp16"

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

repo_root = snapshot_download(model)
checkpoints_json = os.path.join(repo_root, "ds_inference_config.json")
model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    base_dir=repo_root,
    dtype=torch.half,
    checkpoint=checkpoints_json,
    replace_method="auto",
    replace_with_kernel_inject=True,
)

@RezaYazdaniAminabadi would be able to give more details about creating presharded versions for other models.

Sep 30 '22 17:09 mrwyattii

@mrwyattii thanks for the reply, I know how to load the model. But wondering more on how we can pre-shard the model.

This also applies to other model like OPT models. Loading them in CPU and sharding take crazy long time. Maybe it could be done in one-off, save the sharded one to disk and next time skip loading them on CPU again. Would appreciate If there is any instruction that can share to save the sharded model with DeepSpeed

Sep 30 '22 18:09 lanking520

@lanking520

Here is how I did it (if I correctly understand the issue):

say 5-GB shards

python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("model_path/model_name"); \
model.save_pretrained("/path/to/model_name", max_shard_size="5GB")

this step will require 2x model-size cpu memory and then a bit more.

and then use the resulting model like so:

python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("/path/to/model_name");

Oct 03 '22 14:10 archieCanada

@lanking520

Here is how I did it (if I correctly understand the issue):

say 5-GB shards
python -c 'from transformers import AutoModelForSeq2SeqLM; \

model = AutoModelForSeq2SeqLM.from_pretrained("model_path/model_name"); \

model.save_pretrained("/path/to/model_name", max_shard_size="5GB")
this step will require 2x model-size cpu memory and then a bit more.

and then use the resulting model like so:
python -c 'from transformers import AutoModelForSeq2SeqLM; \

model = AutoModelForSeq2SeqLM.from_pretrained("/path/to/model_name");

Nope, this only shards the model based on its size. It doesn't tell which part of the model goes to which GPU in DeepSpeed. Tensor Parallelism means vertical sharding, where models are defined in TP and also sizing. In fact, if you look at the INT8 BLOOM model, each GPU has 4 vertical shards (TP4) and distributed for 8 GPUs. This cannot be done without using DeepSpeed itselves.

Oct 03 '22 14:10 lanking520

@RezaYazdaniAminabadi I found your PR here is really helpful: https://github.com/microsoft/DeepSpeed/pull/2132 This is the key, but it limited to BLOOM model. Does other model also supported by doing this way?

Oct 04 '22 17:10 lanking520

Hi @lanking520,

Thanks for your interest in this part. I am working on bringing this feature for the rest of models. I will let you know once creating that a PR for that. Best, Reza

Oct 04 '22 22:10 RezaYazdaniAminabadi

@lanking520 can you try this again? #2547 should address your issue

Dec 12 '22 18:12 jeffra

nice, will test them today

Dec 12 '22 19:12 lanking520

@jeffra @RezaYazdaniAminabadi do you happened to have a code sample for OPT model?

Dec 12 '22 19:12 lanking520

@lanking520 Here is a small code sample for saving a sharded OPT model:

import os
import torch
import transformers
import deepspeed
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()

model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
    "replace_with_kernel_inject": True,
    "dtype": torch.float16,
    "replace_method": "auto",
    "enable_cuda_graph": False,
    "tensor_parallel": {"tp_size": args.world_size},
}

config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

if args.save_ckpt:
    inf_config["save_mp_checkpoint_path"] = ckpt_path
    model = transformers.AutoModelForCausalLM.from_config(
        config, torch_dtype=torch.float16
    )
else:
    inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = transformers.AutoModelForCausalLM.from_config(
            config, torch_dtype=torch.float16
        )

model = deepspeed.init_inference(model, config=inf_config)

if not args.save_ckpt:
    tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
    for t in tokens:
        if torch.is_tensor(tokens[t]):
            tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
    greedy_output = model.generate(**tokens)
    outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
    if args.local_rank == 0:
        print(outputs)

To save the checkpoint: deepspeed --num_gpus 2 example.py --save_ckpt

Verify the sharded checkpoints were created:

venv ❯ ls /data/sharded-opt-model
ds_inference_config.json  tp_00_00.pt  tp_00_02.pt  tp_00_04.pt  tp_00_06.pt  tp_01_00.pt  tp_01_02.pt  tp_01_04.pt  tp_01_06.pt
non-tp.pt                 tp_00_01.pt  tp_00_03.pt  tp_00_05.pt  tp_00_07.pt  tp_01_01.pt  tp_01_03.pt  tp_01_05.pt  tp_01_07.pt

Load the sharded checkpoint and run a query: deepspeed --num_gpus 2 example.py

Dec 20 '22 19:12 mrwyattii

@mrwyattii. Thanks for sharing. I am able to get this file. But it is completely not loadable to get it back. I tried bloom way but doesn't work into that way

Dec 20 '22 19:12 lanking520

def load_model():
    tensor_parallel = int(os.getenv('WORLD_SIZE', '1'))
    model_name = "path_to_model"
    deepspeed.init_distributed("nccl")
    logging.info(f"Loading the model {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    config = AutoConfig.from_pretrained(model_name)
    # Construct model with fake meta tensors, later will be replaced during ds-inference ckpt load
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)
    model = model.eval()
    torch.cuda.empty_cache()

    ### Deepspeed-Inference Loading
    checkpoints_json = os.path.join(model_name, "ds_inference_config.json")
    model = deepspeed.init_inference(
        model,
        mp_size=tensor_parallel,
        base_dir=repo_root,
        dtype=torch.float16,
        checkpoint=checkpoints_json,
        replace_with_kernel_inject=True)
    torch.cuda.empty_cache()
    gc.collect()
    deepspeed.runtime.utils.see_memory_usage("post-ds-inference-init", force=True)
    model = model.module
    return model, tokenizer

This is the standard way loading back a bloom model. This is not working for OPT

Dec 20 '22 19:12 lanking520

@lanking520 I've updated the code in my previous comment to include loading the model with deepspeed.OnDevice - Can you give it a try?

Dec 20 '22 21:12 mrwyattii

Also check out the example we have in the DeepSpeedExamples repo: https://github.com/microsoft/DeepSpeedExamples/blob/meta-inference/inference/huggingface/text-generation/inference-test.py

#2547 has some information about how to run that script.

Dec 20 '22 22:12 mrwyattii

@lanking520 Here is a small code sample for saving a sharded OPT model:

import os
import torch
import transformers
import deepspeed
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()

model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
    "replace_with_kernel_inject": True,
    "dtype": torch.float16,
    "replace_method": "auto",
    "enable_cuda_graph": False,
    "tensor_parallel": {"tp_size": args.world_size},
}

config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

if args.save_ckpt:
    inf_config["save_mp_checkpoint_path"] = ckpt_path
    model = transformers.AutoModelForCausalLM.from_config(
        config, torch_dtype=torch.float16
    )
else:
    inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = transformers.AutoModelForCausalLM.from_config(
            config, torch_dtype=torch.float16
        )

model = deepspeed.init_inference(model, config=inf_config)

if not args.save_ckpt:
    tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
    for t in tokens:
        if torch.is_tensor(tokens[t]):
            tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
    greedy_output = model.generate(**tokens)
    outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
    if args.local_rank == 0:
        print(outputs)

To save the checkpoint: deepspeed --num_gpus 2 example.py --save_ckpt

Verify the sharded checkpoints were created:

venv ❯ ls /data/sharded-opt-model
ds_inference_config.json  tp_00_00.pt  tp_00_02.pt  tp_00_04.pt  tp_00_06.pt  tp_01_00.pt  tp_01_02.pt  tp_01_04.pt  tp_01_06.pt
non-tp.pt                 tp_00_01.pt  tp_00_03.pt  tp_00_05.pt  tp_00_07.pt  tp_01_01.pt  tp_01_03.pt  tp_01_05.pt  tp_01_07.pt

Load the sharded checkpoint and run a query: deepspeed --num_gpus 2 example.py

When i used example.py, the model was saved well, but there was a problem loading the model. I'm using deepspeed==0.7.7 I think this problem is related to this line inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")

In ckpt_path, ls /hdd/cache/sharded-galactica-125m/ (galactica is also OPT model)

ds_inference_config.json  tp_00_01.pt  tp_00_04.pt  tp_00_07.pt  tp_01_02.pt  tp_01_05.pt
non-tp.pt                 tp_00_02.pt  tp_00_05.pt  tp_01_00.pt  tp_01_03.pt  tp_01_06.pt
tp_00_00.pt               tp_00_03.pt  tp_00_06.pt  tp_01_01.pt  tp_01_04.pt  tp_01_07.pt

Traceback (most recent call last):
  File "saving_sharded_model.py", line 49, in <module>
    model = deepspeed.init_inference(model, config=inf_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 89, in __init__
    self._load_checkpoint(config.checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 411, in _load_checkpoint
    load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_parallel.tp_size,
AttributeError: 'dict' object has no attribute 'load'

(+) It was confirmed that the OPT model was saved and loaded normally when the config was given and used. However, the generated result seems to be different from the intended result.

inf_config = {
    "replace_with_kernel_inject": True,
    #"injection_policy": {OPTDecoderLayer: ('self_attn.out_proj', '.fc2')},
    "dtype": torch.float16,
    "replace_method": "auto",
    #"enable_cuda_graph": False,
    #"tensor_parallel": {"tp_size": args.world_size},
    "tensor_parallel": DeepSpeedTPConfig(enabled=True, tp_size=args.world_size),
    #"mp_size": args.world_size,
}

In : ["Deepspeed is the"] Out : ['DeepSpeed is thePromegaPromegaPromega Mtb other other other other Mtb Mtbfortunately Mtborbed adventorbedorbed']

In order for the model to be loaded normally, it must be given injection_policy directly, and the load does not seem to work well with the current method(= replace_method='auto', replace_with_kernel_inject=True).

Dec 21 '22 06:12 slrsnpdla

@slrsnpdla I'm able to reproduce the bad output you are seeing. This appears to only happen for some models. I've extended the unit tests for sharded checkpoints to include a correctness test in #2643. I'm seeing failures for gpt-neo and gpt-j models as well.

@RezaYazdaniAminabadi

Dec 22 '22 02:12 mrwyattii

Also reproducible from my end, tested OPT, GPT-Neo and GPT-J is kind of broken.

Jan 03 '23 17:01 lanking520

Hi @lanking520,

I am working on resolving this issue. I will let you know once I have the solution tested completely. Thanks, Reza

Jan 03 '23 19:01 RezaYazdaniAminabadi

Hi @RezaYazdaniAminabadi or @jeffra do we have any weekly/monthly community meeting for DeepSpeed? I would like to attend if there is one.

Jan 03 '23 22:01 lanking520

Hi @lanking520,

I have verified several model architectures with this PR and using this test-suite. All works fine on my side. Could you please try this on your end and see if the issue is resolved? Here is how I ran different models:

deepspeed --num_gpus N inference-test.py --ds_inference --use_kernel --name `HF-moel-name` --use_meta_tensor --checkpoint_path ~/.cache/huggingface/hub/...

Regarding your last question, I don't think there is any meeting currently set up. But, I think this is a great idea. I let @jeffra or @tjruwase chime in here, and we might be able to set up something.

Thanks, Reza

Jan 04 '23 06:01 RezaYazdaniAminabadi

Will start testing this week. Thanks @RezaYazdaniAminabadi.

Jan 12 '23 00:01 lanking520

Hi @mrwyattii ,

Thank you for this example! I have 2 questiones regarding your code.

@lanking520 Here is a small code sample for saving a sharded OPT model:

import os
import torch
import transformers
import deepspeed
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()

model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
    "replace_with_kernel_inject": True,
    "dtype": torch.float16,
    "replace_method": "auto",
    "enable_cuda_graph": False,
    "tensor_parallel": {"tp_size": args.world_size},
}

config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

if args.save_ckpt:
    inf_config["save_mp_checkpoint_path"] = ckpt_path
    model = transformers.AutoModelForCausalLM.from_config(
        config, torch_dtype=torch.float16
    )
else:
    inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = transformers.AutoModelForCausalLM.from_config(
            config, torch_dtype=torch.float16
        )

model = deepspeed.init_inference(model, config=inf_config)

if not args.save_ckpt:
    tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
    for t in tokens:
        if torch.is_tensor(tokens[t]):
            tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
    greedy_output = model.generate(**tokens)
    outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
    if args.local_rank == 0:
        print(outputs)

To save the checkpoint: deepspeed --num_gpus 2 example.py --save_ckpt

Verify the sharded checkpoints were created:

venv ❯ ls /data/sharded-opt-model
ds_inference_config.json  tp_00_00.pt  tp_00_02.pt  tp_00_04.pt  tp_00_06.pt  tp_01_00.pt  tp_01_02.pt  tp_01_04.pt  tp_01_06.pt
non-tp.pt                 tp_00_01.pt  tp_00_03.pt  tp_00_05.pt  tp_00_07.pt  tp_01_01.pt  tp_01_03.pt  tp_01_05.pt  tp_01_07.pt

Load the sharded checkpoint and run a query: deepspeed --num_gpus 2 example.py

This line model = transformers.AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16) will run --num_gpus times when you are saving preshard checkpoints. For example, gpt-neo-x 20B takes about 40GB in RAM, and if you run this script with deepspeed --num_gpus 4 example.py --save_ckpt, you will end up using 4 * 40GB in RAM.

My first question is: Is there a way to load the model in RAM only 1 time instead of 4 times to save RAM and still save preshard checkpoints?

My second question is (might be a stupid one): When you run deepspeed --num_gpus 4 example.py for inference, does the model get split into 4 pieces in terms of GPU memory usage? For example, gpt-neo-x 20B needs about 40GB and does each GPU only use about 10GB if I have 4 of them?

Jan 13 '23 03:01 Wenhan-Tan

My first question is: Is there a way to load the model in RAM only 1 time instead of 4 times to save RAM and still save preshard checkpoints?

My second question is (might be a stupid one): When you run deepspeed --num_gpus 4 example.py for inference, does the model get split into 4 pieces in terms of GPU memory usage? For example, gpt-neo-x 20B needs about 40GB and does each GPU only use about 10GB if I have 4 of them?

I don't think this is currently possible. You can load the model with metatensors and provide the pytorch_model.bin as the checkpoint file in init_inference when creating the sharded checkpoint (see examples here), however I just did a quick test and this doesn't seem to lower the maximum memory usage. @RezaYazdaniAminabadi can you confirm this is the case?
The model will be split across GPU as you have described (but if I recall correctly some parts of the model will be present on all GPUs). If you are seeing that this is not the case when you check memory usage via nvidia-smi try clearing the cache after the call to init_inference and check again

Jan 17 '23 20:01 mrwyattii

When we took your PR and tested it with your test suite. We got the following error for both OPT 1.3B and GPTJ 6B.

Saving tp-sharded checkpoints
Loading 0 checkpoint shards: 0it [00:02, ?it/s]
Traceback (most recent call last):
  File "inference-test.py", line 51, in <module>
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 129, in __init__
    self.module.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

Jan 21 '23 03:01 sindhuvahinis

@RezaYazdaniAminabadi @lekurile did you get any luck to not seeing the above error? ^^

Jan 23 '23 16:01 lanking520

Hi @lanking520 , I think the error that you're seeing maybe comes from the flag low_cpu_mem_usage=True or device_map="auto" when you call the from_pretrained() method. Either of these two flags uses meta weights when you load the model, they're fake data meaning there are no actual values.

Jan 25 '23 01:01 Wenhan-Tan

@Wenhan-Tan this is the way why Meta tensor is here:

with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)

This is a must have step in order to use DeepSpeed checkpoint loading. You need to have a placeholder at the place to build the full model. I think it still stay true with @RezaYazdaniAminabadi 's commit. We need to send the model body in to get full weights equipped by DS. For some reason, the checkpoint weight was not taken by DeepSpeed

Jan 25 '23 03:01 lanking520

@lanking520 You're right! I also ran the script, and the checkpoint weight loading worked on my machine. It makes me wonder whether if the weight saving is successful on his @sindhuvahinis machine.

Jan 25 '23 17:01 Wenhan-Tan

@Wenhan-Tan @RezaYazdaniAminabadi I verified it again with larger instance with 8GPUs.

Did you test for GPTJ 6b model? I was able to generate checkpoints. I tried But loading back the generated checkpoints throws the below error.

NotImplementedError: Cannot copy out of meta tensor; no data!
Traceback (most recent call last):
  File "inference-test.py", line 57, in <module>
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 129, in __init__
    self.module.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

To reproduce I am using the deepspeed test-suite

deepspeed --num_nodes 1 \
     --num_gpus 4     \
    inference-test.py \
    --use_kernel   \    
    --ds_inference \
    --name EleutherAI/gpt-j-6B \
    --checkpoint_path /tmp/ws/gpt-j-6B/ \
    --save_mp_checkpoint_path /tmp/ws/sharded-gptj-6b/

Jan 30 '23 17:01 sindhuvahinis

Hi @sindhuvahinis , I didn't try GPTJ but it did work on GPTNeox for me

Feb 02 '23 21:02 Wenhan-Tan

DeepSpeed DeepSpeed copied to clipboard

[Question] How to preshard a model for tensor parallism

DeepSpeed
DeepSpeed copied to clipboard