DeepSpeed
DeepSpeed copied to clipboard
[Question] How to preshard a model for tensor parallism
Currently we are trying to run inference with pretrained BLOOM model. However, the loading takes very long due to DeepSpeed sharding in runtime. Since there is a pre-sharded version of BLOOM:
microsoft/bloom-deepspeed-inference-fp16
Is that possible to share the script that done the above job or any guidance on how to preshard to speed up loading experience?
Hi @lanking520 for loading this presharded version:
import os
import torch
import deepspeed
from huggingface_hub import snapshot_download
model = "microsoft/bloom-deepspeed-inference-fp16"
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
repo_root = snapshot_download(model)
checkpoints_json = os.path.join(repo_root, "ds_inference_config.json")
model = deepspeed.init_inference(
model,
mp_size=world_size,
base_dir=repo_root,
dtype=torch.half,
checkpoint=checkpoints_json,
replace_method="auto",
replace_with_kernel_inject=True,
)
@RezaYazdaniAminabadi would be able to give more details about creating presharded versions for other models.
@mrwyattii thanks for the reply, I know how to load the model. But wondering more on how we can pre-shard the model.
This also applies to other model like OPT models. Loading them in CPU and sharding take crazy long time. Maybe it could be done in one-off, save the sharded one to disk and next time skip loading them on CPU again. Would appreciate If there is any instruction that can share to save the sharded model with DeepSpeed
@lanking520
Here is how I did it (if I correctly understand the issue):
say 5-GB shards
python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("model_path/model_name"); \
model.save_pretrained("/path/to/model_name", max_shard_size="5GB")
this step will require 2x model-size cpu memory and then a bit more.
and then use the resulting model like so:
python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("/path/to/model_name");
@lanking520
Here is how I did it (if I correctly understand the issue):
say 5-GB shards
python -c 'from transformers import AutoModelForSeq2SeqLM; \ model = AutoModelForSeq2SeqLM.from_pretrained("model_path/model_name"); \ model.save_pretrained("/path/to/model_name", max_shard_size="5GB")this step will require 2x model-size cpu memory and then a bit more.
and then use the resulting model like so:
python -c 'from transformers import AutoModelForSeq2SeqLM; \ model = AutoModelForSeq2SeqLM.from_pretrained("/path/to/model_name");
Nope, this only shards the model based on its size. It doesn't tell which part of the model goes to which GPU in DeepSpeed. Tensor Parallelism means vertical sharding, where models are defined in TP and also sizing. In fact, if you look at the INT8 BLOOM model, each GPU has 4 vertical shards (TP4) and distributed for 8 GPUs. This cannot be done without using DeepSpeed itselves.
@RezaYazdaniAminabadi I found your PR here is really helpful: https://github.com/microsoft/DeepSpeed/pull/2132 This is the key, but it limited to BLOOM model. Does other model also supported by doing this way?
Hi @lanking520,
Thanks for your interest in this part. I am working on bringing this feature for the rest of models. I will let you know once creating that a PR for that. Best, Reza
@lanking520 can you try this again? #2547 should address your issue
nice, will test them today
@jeffra @RezaYazdaniAminabadi do you happened to have a code sample for OPT model?
@lanking520 Here is a small code sample for saving a sharded OPT model:
import os
import torch
import transformers
import deepspeed
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()
model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
"replace_with_kernel_inject": True,
"dtype": torch.float16,
"replace_method": "auto",
"enable_cuda_graph": False,
"tensor_parallel": {"tp_size": args.world_size},
}
config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
if args.save_ckpt:
inf_config["save_mp_checkpoint_path"] = ckpt_path
model = transformers.AutoModelForCausalLM.from_config(
config, torch_dtype=torch.float16
)
else:
inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
model = transformers.AutoModelForCausalLM.from_config(
config, torch_dtype=torch.float16
)
model = deepspeed.init_inference(model, config=inf_config)
if not args.save_ckpt:
tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
for t in tokens:
if torch.is_tensor(tokens[t]):
tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
greedy_output = model.generate(**tokens)
outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
if args.local_rank == 0:
print(outputs)
To save the checkpoint:
deepspeed --num_gpus 2 example.py --save_ckpt
Verify the sharded checkpoints were created:
venv ❯ ls /data/sharded-opt-model
ds_inference_config.json tp_00_00.pt tp_00_02.pt tp_00_04.pt tp_00_06.pt tp_01_00.pt tp_01_02.pt tp_01_04.pt tp_01_06.pt
non-tp.pt tp_00_01.pt tp_00_03.pt tp_00_05.pt tp_00_07.pt tp_01_01.pt tp_01_03.pt tp_01_05.pt tp_01_07.pt
Load the sharded checkpoint and run a query:
deepspeed --num_gpus 2 example.py
@mrwyattii. Thanks for sharing. I am able to get this file. But it is completely not loadable to get it back. I tried bloom way but doesn't work into that way
def load_model():
tensor_parallel = int(os.getenv('WORLD_SIZE', '1'))
model_name = "path_to_model"
deepspeed.init_distributed("nccl")
logging.info(f"Loading the model {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
# Construct model with fake meta tensors, later will be replaced during ds-inference ckpt load
with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)
model = model.eval()
torch.cuda.empty_cache()
### Deepspeed-Inference Loading
checkpoints_json = os.path.join(model_name, "ds_inference_config.json")
model = deepspeed.init_inference(
model,
mp_size=tensor_parallel,
base_dir=repo_root,
dtype=torch.float16,
checkpoint=checkpoints_json,
replace_with_kernel_inject=True)
torch.cuda.empty_cache()
gc.collect()
deepspeed.runtime.utils.see_memory_usage("post-ds-inference-init", force=True)
model = model.module
return model, tokenizer
This is the standard way loading back a bloom model. This is not working for OPT
@lanking520 I've updated the code in my previous comment to include loading the model with deepspeed.OnDevice - Can you give it a try?
Also check out the example we have in the DeepSpeedExamples repo: https://github.com/microsoft/DeepSpeedExamples/blob/meta-inference/inference/huggingface/text-generation/inference-test.py
#2547 has some information about how to run that script.
@lanking520 Here is a small code sample for saving a sharded OPT model:
import os import torch import transformers import deepspeed import argparse parser = argparse.ArgumentParser() parser.add_argument("--save_ckpt", action="store_true") parser.add_argument("--local_rank", type=int, default=0) parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1))) args = parser.parse_args() model_name = "facebook/opt-1.3b" inputs = ["DeepSpeed is the"] ckpt_path = "/data/sharded-opt-model/" inf_config = { "replace_with_kernel_inject": True, "dtype": torch.float16, "replace_method": "auto", "enable_cuda_graph": False, "tensor_parallel": {"tp_size": args.world_size}, } config = transformers.AutoConfig.from_pretrained(model_name) tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) if args.save_ckpt: inf_config["save_mp_checkpoint_path"] = ckpt_path model = transformers.AutoModelForCausalLM.from_config( config, torch_dtype=torch.float16 ) else: inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json") with deepspeed.OnDevice(dtype=torch.float16, device="meta"): model = transformers.AutoModelForCausalLM.from_config( config, torch_dtype=torch.float16 ) model = deepspeed.init_inference(model, config=inf_config) if not args.save_ckpt: tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True) for t in tokens: if torch.is_tensor(tokens[t]): tokens[t] = tokens[t].to(f"cuda:{args.local_rank}") greedy_output = model.generate(**tokens) outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True) if args.local_rank == 0: print(outputs)To save the checkpoint:
deepspeed --num_gpus 2 example.py --save_ckptVerify the sharded checkpoints were created:
venv ❯ ls /data/sharded-opt-model ds_inference_config.json tp_00_00.pt tp_00_02.pt tp_00_04.pt tp_00_06.pt tp_01_00.pt tp_01_02.pt tp_01_04.pt tp_01_06.pt non-tp.pt tp_00_01.pt tp_00_03.pt tp_00_05.pt tp_00_07.pt tp_01_01.pt tp_01_03.pt tp_01_05.pt tp_01_07.ptLoad the sharded checkpoint and run a query:
deepspeed --num_gpus 2 example.py
When i used example.py, the model was saved well, but there was a problem loading the model.
I'm using deepspeed==0.7.7
I think this problem is related to this line
inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
In ckpt_path,
ls /hdd/cache/sharded-galactica-125m/ (galactica is also OPT model)
ds_inference_config.json tp_00_01.pt tp_00_04.pt tp_00_07.pt tp_01_02.pt tp_01_05.pt
non-tp.pt tp_00_02.pt tp_00_05.pt tp_01_00.pt tp_01_03.pt tp_01_06.pt
tp_00_00.pt tp_00_03.pt tp_00_06.pt tp_01_01.pt tp_01_04.pt tp_01_07.pt
Traceback (most recent call last):
File "saving_sharded_model.py", line 49, in <module>
model = deepspeed.init_inference(model, config=inf_config)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 89, in __init__
self._load_checkpoint(config.checkpoint)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 411, in _load_checkpoint
load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_parallel.tp_size,
AttributeError: 'dict' object has no attribute 'load'
(+) It was confirmed that the OPT model was saved and loaded normally when the config was given and used. However, the generated result seems to be different from the intended result.
inf_config = {
"replace_with_kernel_inject": True,
#"injection_policy": {OPTDecoderLayer: ('self_attn.out_proj', '.fc2')},
"dtype": torch.float16,
"replace_method": "auto",
#"enable_cuda_graph": False,
#"tensor_parallel": {"tp_size": args.world_size},
"tensor_parallel": DeepSpeedTPConfig(enabled=True, tp_size=args.world_size),
#"mp_size": args.world_size,
}
In : ["Deepspeed is the"] Out : ['DeepSpeed is thePromegaPromegaPromega Mtb other other other other Mtb Mtbfortunately Mtborbed adventorbedorbed']
In order for the model to be loaded normally, it must be given injection_policy directly, and the load does not seem to work well with the current method(= replace_method='auto', replace_with_kernel_inject=True).
@slrsnpdla I'm able to reproduce the bad output you are seeing. This appears to only happen for some models. I've extended the unit tests for sharded checkpoints to include a correctness test in #2643. I'm seeing failures for gpt-neo and gpt-j models as well.
@RezaYazdaniAminabadi
Also reproducible from my end, tested OPT, GPT-Neo and GPT-J is kind of broken.
Hi @lanking520,
I am working on resolving this issue. I will let you know once I have the solution tested completely. Thanks, Reza
Hi @RezaYazdaniAminabadi or @jeffra do we have any weekly/monthly community meeting for DeepSpeed? I would like to attend if there is one.
Hi @lanking520,
I have verified several model architectures with this PR and using this test-suite. All works fine on my side. Could you please try this on your end and see if the issue is resolved? Here is how I ran different models:
deepspeed --num_gpus N inference-test.py --ds_inference --use_kernel --name `HF-moel-name` --use_meta_tensor --checkpoint_path ~/.cache/huggingface/hub/...
Regarding your last question, I don't think there is any meeting currently set up. But, I think this is a great idea. I let @jeffra or @tjruwase chime in here, and we might be able to set up something.
Thanks, Reza
Will start testing this week. Thanks @RezaYazdaniAminabadi.
Hi @mrwyattii ,
Thank you for this example! I have 2 questiones regarding your code.
@lanking520 Here is a small code sample for saving a sharded OPT model:
import os import torch import transformers import deepspeed import argparse parser = argparse.ArgumentParser() parser.add_argument("--save_ckpt", action="store_true") parser.add_argument("--local_rank", type=int, default=0) parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1))) args = parser.parse_args() model_name = "facebook/opt-1.3b" inputs = ["DeepSpeed is the"] ckpt_path = "/data/sharded-opt-model/" inf_config = { "replace_with_kernel_inject": True, "dtype": torch.float16, "replace_method": "auto", "enable_cuda_graph": False, "tensor_parallel": {"tp_size": args.world_size}, } config = transformers.AutoConfig.from_pretrained(model_name) tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) if args.save_ckpt: inf_config["save_mp_checkpoint_path"] = ckpt_path model = transformers.AutoModelForCausalLM.from_config( config, torch_dtype=torch.float16 ) else: inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json") with deepspeed.OnDevice(dtype=torch.float16, device="meta"): model = transformers.AutoModelForCausalLM.from_config( config, torch_dtype=torch.float16 ) model = deepspeed.init_inference(model, config=inf_config) if not args.save_ckpt: tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True) for t in tokens: if torch.is_tensor(tokens[t]): tokens[t] = tokens[t].to(f"cuda:{args.local_rank}") greedy_output = model.generate(**tokens) outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True) if args.local_rank == 0: print(outputs)To save the checkpoint:
deepspeed --num_gpus 2 example.py --save_ckptVerify the sharded checkpoints were created:
venv ❯ ls /data/sharded-opt-model ds_inference_config.json tp_00_00.pt tp_00_02.pt tp_00_04.pt tp_00_06.pt tp_01_00.pt tp_01_02.pt tp_01_04.pt tp_01_06.pt non-tp.pt tp_00_01.pt tp_00_03.pt tp_00_05.pt tp_00_07.pt tp_01_01.pt tp_01_03.pt tp_01_05.pt tp_01_07.ptLoad the sharded checkpoint and run a query:
deepspeed --num_gpus 2 example.py
This line model = transformers.AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16) will run --num_gpus times when you are saving preshard checkpoints. For example, gpt-neo-x 20B takes about 40GB in RAM, and if you run this script with deepspeed --num_gpus 4 example.py --save_ckpt, you will end up using 4 * 40GB in RAM.
My first question is: Is there a way to load the model in RAM only 1 time instead of 4 times to save RAM and still save preshard checkpoints?
My second question is (might be a stupid one): When you run deepspeed --num_gpus 4 example.py for inference, does the model get split into 4 pieces in terms of GPU memory usage? For example, gpt-neo-x 20B needs about 40GB and does each GPU only use about 10GB if I have 4 of them?
My first question is: Is there a way to load the model in RAM only 1 time instead of 4 times to save RAM and still save preshard checkpoints?
My second question is (might be a stupid one): When you run
deepspeed --num_gpus 4 example.pyfor inference, does the model get split into 4 pieces in terms of GPU memory usage? For example,gpt-neo-x 20Bneeds about 40GB and does each GPU only use about 10GB if I have 4 of them?
-
I don't think this is currently possible. You can load the model with metatensors and provide the
pytorch_model.binas the checkpoint file ininit_inferencewhen creating the sharded checkpoint (see examples here), however I just did a quick test and this doesn't seem to lower the maximum memory usage. @RezaYazdaniAminabadi can you confirm this is the case? -
The model will be split across GPU as you have described (but if I recall correctly some parts of the model will be present on all GPUs). If you are seeing that this is not the case when you check memory usage via
nvidia-smitry clearing the cache after the call toinit_inferenceand check again
When we took your PR and tested it with your test suite. We got the following error for both OPT 1.3B and GPTJ 6B.
Saving tp-sharded checkpoints
Loading 0 checkpoint shards: 0it [00:02, ?it/s]
Traceback (most recent call last):
File "inference-test.py", line 51, in <module>
pipe.model = deepspeed.init_inference(pipe.model,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 129, in __init__
self.module.to(device)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 927, in to
return self._apply(convert)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
@RezaYazdaniAminabadi @lekurile did you get any luck to not seeing the above error? ^^
Hi @lanking520 , I think the error that you're seeing maybe comes from the flag low_cpu_mem_usage=True or device_map="auto" when you call the from_pretrained() method. Either of these two flags uses meta weights when you load the model, they're fake data meaning there are no actual values.
@Wenhan-Tan this is the way why Meta tensor is here:
with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)
This is a must have step in order to use DeepSpeed checkpoint loading. You need to have a placeholder at the place to build the full model. I think it still stay true with @RezaYazdaniAminabadi 's commit. We need to send the model body in to get full weights equipped by DS. For some reason, the checkpoint weight was not taken by DeepSpeed
@lanking520 You're right! I also ran the script, and the checkpoint weight loading worked on my machine. It makes me wonder whether if the weight saving is successful on his @sindhuvahinis machine.
@Wenhan-Tan @RezaYazdaniAminabadi I verified it again with larger instance with 8GPUs.
Did you test for GPTJ 6b model? I was able to generate checkpoints. I tried But loading back the generated checkpoints throws the below error.
NotImplementedError: Cannot copy out of meta tensor; no data!
Traceback (most recent call last):
File "inference-test.py", line 57, in <module>
pipe.model = deepspeed.init_inference(pipe.model,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 129, in __init__
self.module.to(device)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 927, in to
return self._apply(convert)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
To reproduce I am using the deepspeed test-suite
deepspeed --num_nodes 1 \
--num_gpus 4 \
inference-test.py \
--use_kernel \
--ds_inference \
--name EleutherAI/gpt-j-6B \
--checkpoint_path /tmp/ws/gpt-j-6B/ \
--save_mp_checkpoint_path /tmp/ws/sharded-gptj-6b/
Hi @sindhuvahinis , I didn't try GPTJ but it did work on GPTNeox for me