DeepSpeed
DeepSpeed copied to clipboard
[BUG] Inference a large model using NVMe offload. AssertionError: More elements 524709888 than buffer size 100,000,000
Describe the bug
I'm trying to run inference of a 54 billion model (facebook/nllb-moe-54b) using NVMe offload on my laptop with a RTX 3060 (6GB GPU memory). But I get a Error message: AssertionError: More elements 524709888 than buffer size 100,000,000
Full error message is:
(deepspeed) mark@lsl-pc:~/Research/accelerate/examples$ deepspeed --num_gpus 1 nllb_ZeRO_inference.py
[2023-05-10 17:29:06,119] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-10 17:29:06,127] [INFO] [runner.py:541:main] cmd = /home/mark/anaconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None nllb_ZeRO_inference.py
[2023-05-10 17:29:07,275] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-10 17:29:07,275] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-10 17:29:07,275] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-10 17:29:07,275] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-10 17:29:07,275] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-10 17:29:08,390] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-10 17:29:08,611] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
[2023-05-10 17:29:11,725] [INFO] [utils.py:30:print_object] AsyncPartitionedParameterSwapper:
[2023-05-10 17:29:11,725] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-10 17:29:11,725] [INFO] [utils.py:34:print_object] aio_handle ................... <class 'async_io.aio_handle'>
[2023-05-10 17:29:11,725] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024
[2023-05-10 17:29:11,725] [INFO] [utils.py:34:print_object] aligned_elements_per_buffer .. 100000256
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] available_buffer_ids ......... [0, 1, 2, 3, 4]
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] available_numel .............. 0
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] available_params ............. set()
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] dtype ........................ torch.float16
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] elements_per_buffer .......... 100,000,000
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] id_to_path ................... {}
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] inflight_numel ............... 0
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] inflight_params .............. []
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] inflight_swap_in_buffers ..... []
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] invalid_buffer ............... 1.0
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] numel_alignment .............. 512
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] param_buffer_count ........... 5
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] param_id_to_buffer_id ........ {}
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] param_id_to_numel ............ {}
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] param_id_to_swap_buffer ...... {}
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] partitioned_swap_buffer ...... None
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] partitioned_swap_pool ........ None
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] pending_reads ................ 0
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] pending_writes ............... 0
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] reserved_buffer_ids .......... []
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] swap_config .................. device='nvme' nvme_path=PosixPath('/home/mark/Research/nvme_offload_path') buffer_count=5 buffer_size=100,000,000 max_in_cpu=1,000,000,000 pin_memory=True
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] swap_element_size ............ 2
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] swap_folder .................. /home/mark/Research/nvme_offload_path/zero_stage_3/float16params/rank0
[2023-05-10 17:29:11,726] [INFO] [utils.py:34:print_object] swap_out_params .............. []
[2023-05-10 17:29:11,778] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 0.52B parameters
Traceback (most recent call last):
File "/home/mark/Research/accelerate/examples/nllb_ZeRO_inference.py", line 187, in <module>
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2629, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 382, in wrapper
f(module, *args, **kwargs)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/models/nllb_moe/modeling_nllb_moe.py", line 1658, in __init__
self.model = NllbMoeModel(config)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 382, in wrapper
f(module, *args, **kwargs)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/models/nllb_moe/modeling_nllb_moe.py", line 1517, in __init__
self.shared = nn.Embedding(vocab_size, config.d_model, padding_idx)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 389, in wrapper
self._post_init_method(module)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 822, in _post_init_method
param.partition()
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 948, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1086, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1136, in _partition_param
buffer = self.param_swapper.get_buffer(param, partition_size)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 343, in get_buffer
assert numel < self.elements_per_buffer, f"More elements {numel} than buffer size {self.elements_per_buffer}"
AssertionError: More elements 524709888 than buffer size 100,000,000
[2023-05-10 17:29:13,282] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17152
[2023-05-10 17:29:13,282] [ERROR] [launch.py:434:sigkill_handler] ['/home/mark/anaconda3/envs/deepspeed/bin/python', '-u', 'nllb_ZeRO_inference.py', '--local_rank=0'] exits with return code = 1
The ds_config is:
ds_config = {
"fp16": {
"enabled": False
},
"bf16": {
"enabled": False
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "nvme",
"nvme_path": "/home/mark/Research/nvme_offload_path",
"buffer_count": 6,
"buffer_size": 6e8,
"max_in_cpu": 1e9
},
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_bucket_size": model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size": 0.1 * model_hidden_size * model_hidden_size,
"stage3_max_live_parameters": 1e8,
"stage3_max_reuse_distance": 1e8,
"stage3_param_persistence_threshold": 10 * model_hidden_size
},
"aio": {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": False,
"overlap_events": True
},
"steps_per_print": 2000,
"train_batch_size": train_batch_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False
}
Then I try to change "buffer_size" in "offload_param" of ds_config from 1e8 to 6e8, but then I got 'CUDA out of memory' like this:
[2023-05-10 17:55:31,469] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/home/mark/Research/accelerate/examples/nllb_ZeRO_inference.py", line 187, in <module>
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2624, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 762, in __init__
self.param_swapper = AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 45, in __init__
self._configure_aio(ds_config)
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 107, in _configure_aio
self.buffers = get_accelerator().pin_memory(
File "/home/mark/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/accelerator/cuda_accelerator.py", line 217, in pin_memory
return tensor.pin_memory()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2023-05-10 17:55:35,370] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 19108
[2023-05-10 17:55:35,371] [ERROR] [launch.py:434:sigkill_handler] ['/home/mark/anaconda3/envs/deepspeed/bin/python', '-u', 'nllb_ZeRO_inference.py', '--local_rank=0'] exits with return code = 1
If I have a GPU with larger memory, it may works but how can I run NVMe offload on this 6GB memory GPU?
To Reproduce Steps to reproduce the behavior: My script is modified from https://github.com/huggingface/transformers/issues/16616 Here is my script. I use 'deepspeed --num_gpus 1 nllb_ZeRO_inference.py' to run.
#!/usr/bin/env python
# from: https://github.com/huggingface/transformers/issues/16616
# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py
from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch
os.environ["TOKENIZERS_PARALLELISM"] = "False" # To avoid warnings about parallelism in tokenizers
# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()
# model_name = "bigscience/T0"
# model_name = "bigscience/T0_3B"
model_name = "facebook/nllb-moe-54b"
config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model
# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size
# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed
# XXX: modified this script to use nvme offload so need to explain the new configs, but the key is
# to change the path to `nvme_path`
# keeping the same format as json for consistency, except it uses lower case for True/False
# fmt: off
ds_config = {
"fp16": {
"enabled": False
},
"bf16": {
"enabled": False
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "nvme",
"nvme_path": "/home/mark/Research/nvme_offload_path",
"buffer_count": 6,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_bucket_size": model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size": 0.1 * model_hidden_size * model_hidden_size,
"stage3_max_live_parameters": 1e8,
"stage3_max_reuse_distance": 1e8,
"stage3_param_persistence_threshold": 10 * model_hidden_size
},
"aio": {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": False,
"overlap_events": True
},
"steps_per_print": 2000,
"train_batch_size": train_batch_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False
}
# fmt: on
# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
# now a model can be loaded.
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)
# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval() # inference
# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
text_in = "what do you think of president Obama?"
elif rank == 1:
text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt",padding = True).to(device=local_rank)
#from transformers.deepspeed import is_deepspeed_zero3_enabled
#print(f"Deepspeed 3 is enabled: {is_deepspeed_zero3_enabled()}")
with torch.no_grad():
outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n in={text_in}\n out={text_out}")
Screenshots
System info (please complete the following information):
- OS: ubuntu 22.04
- Memory: 16GB and 10GB Swap
- GPU count and types: 1 RTX 3060 laptop with 6GB memory
- (if applicable) what DeepSpeed-MII version are you using
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
- Python version: 3.10