DeepSpeed
DeepSpeed copied to clipboard
[BUG] OOM error during setup, but the OS GPU memory monitor does not show an OOM condition
Describe the bug When attempting to run this I get an OOM:
accelerate launch train_dreambooth.py --pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 --instance_data_dir=./inputs --output_dir=./outputs --instance_prompt="a photo of sks dog" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=5e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=400
Loading extension module utils...
Time to load utils op: 19.92158341407776 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)]
[2023-01-07 20:45:30,471] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2023-01-07 20:45:30,472] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB Max_MA 1.66 GB CA 3.27 GB Max_CA 3 GB
[2023-01-07 20:45:30,472] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 45.9 GB, percent = 73.2%
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/*****/diffusers/examples/dreambooth/train_dreambooth.py:828 in <module> │
│ │
│ 825 │
│ 826 if __name__ == "__main__": │
│ 827 │ args = parse_args() │
│ ❱ 828 │ main(args) │
│ 829 │
│ │
│ /home/*****/diffusers/examples/dreambooth/train_dreambooth.py:657 in main │
│ │
│ 654 │ │ │ unet, text_encoder, optimizer, train_dataloader, lr_scheduler │
│ 655 │ │ ) │
│ 656 │ else: │
│ ❱ 657 │ │ unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( │
│ 658 │ │ │ unet, optimizer, train_dataloader, lr_scheduler │
│ 659 │ │ ) │
│ 660 │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/accelerator.py:872 in │
│ prepare │
│ │
│ 869 │ │ │ old_named_params = self._get_named_parameters(*args) │
│ 870 │ │ │
│ 871 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │
│ ❱ 872 │ │ │ result = self._prepare_deepspeed(*args) │
│ 873 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │
│ 874 │ │ │ result = self._prepare_megatron_lm(*args) │
│ 875 │ │ else: │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/accelerator.py:1093 │
│ in _prepare_deepspeed │
│ │
│ 1090 │ │ │ │ │ │ if type(scheduler).__name__ in deepspeed.runtime.lr_schedules.VA │
│ 1091 │ │ │ │ │ │ │ kwargs["lr_scheduler"] = scheduler │
│ 1092 │ │ │ │
│ ❱ 1093 │ │ │ engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) │
│ 1094 │ │ │ if optimizer is not None: │
│ 1095 │ │ │ │ optimizer = DeepSpeedOptimizerWrapper(optimizer) │
│ 1096 │ │ │ if scheduler is not None: │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/__init__.py:125 in │
│ initialize │
│ │
│ 122 │ assert model is not None, "deepspeed.initialize requires a model" │
│ 123 │ │
│ 124 │ if not isinstance(model, PipelineModule): │
│ ❱ 125 │ │ engine = DeepSpeedEngine(args=args, │
│ 126 │ │ │ │ │ │ │ │ model=model, │
│ 127 │ │ │ │ │ │ │ │ optimizer=optimizer, │
│ 128 │ │ │ │ │ │ │ │ model_parameters=model_parameters, │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/engine.py:330 │
│ in __init__ │
│ │
│ 327 │ │ │ model_parameters = self.module.parameters() │
│ 328 │ │ │
│ 329 │ │ if has_optimizer: │
│ ❱ 330 │ │ │ self._configure_optimizer(optimizer, model_parameters) │
│ 331 │ │ │ self._configure_lr_scheduler(lr_scheduler) │
│ 332 │ │ │ self._report_progress(0) │
│ 333 │ │ elif self.zero_optimization(): │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1210 │
│ in _configure_optimizer │
│ │
│ 1207 │ │ optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer) │
│ 1208 │ │ │
│ 1209 │ │ if optimizer_wrapper == ZERO_OPTIMIZATION: │
│ ❱ 1210 │ │ │ self.optimizer = self._configure_zero_optimizer(basic_optimizer) │
│ 1211 │ │ elif optimizer_wrapper == AMP: │
│ 1212 │ │ │ amp_params = self.amp_params() │
│ 1213 │ │ │ log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0]) │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1455 │
│ in _configure_zero_optimizer │
│ │
│ 1452 │ │ │ │ │ │ "Pipeline parallelism does not support overlapped communication, │
│ 1453 │ │ │ │ │ ) │
│ 1454 │ │ │ │ │ overlap_comm = False │
│ ❱ 1455 │ │ │ optimizer = DeepSpeedZeroOptimizer( │
│ 1456 │ │ │ │ optimizer, │
│ 1457 │ │ │ │ self.param_names, │
│ 1458 │ │ │ │ timers=timers, │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_a │
│ nd_2.py:521 in __init__ │
│ │
│ 518 │ │ │ self.dynamic_loss_scale = True │
│ 519 │ │ │
│ 520 │ │ see_memory_usage("Before initializing optimizer states", force=True) │
│ ❱ 521 │ │ self.initialize_optimizer_states() │
│ 522 │ │ see_memory_usage("After initializing optimizer states", force=True) │
│ 523 │ │ │
│ 524 │ │ if dist.get_rank() == 0: │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_a │
│ nd_2.py:644 in initialize_optimizer_states │
│ │
│ 641 │ │ │ │ dtype=self.single_partition_of_fp32_groups[i].dtype, │
│ 642 │ │ │ │ device=self.device) │
│ 643 │ │ │ self.single_partition_of_fp32_groups[ │
│ ❱ 644 │ │ │ │ i].grad = single_grad_partition.pin_memory( │
│ 645 │ │ │ │ ) if self.cpu_offload else single_grad_partition │
│ 646 │ │ │
│ 647 │ │ self.optimizer.step() │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: out of memory
Expected behavior I'd expect the training app to run, and for the GPU memory to be properly managed
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/torch']
torch version .................... 1.13.0
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6
Screenshots
System info (please complete the following information):
- OS: Windows 10 WSL2 5.15.79.1-microsoft-standard-WSL2
- GPU 1 x RTX 3060
- Python version 3.10.8
Launcher context WSL2
I have the same issue. I can train T5-large on a single 3090 Ti with batch_size=4 without OOM. However, when I tried to use DeepSpeed to speed up training (still on the single 3090 Ti), it said OOM with even batch_size=1, but the memory monitor does not show OOM as well.
I'm followed the doc from HuggingFace. I'm also using WSL2 on windows.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/torch']
torch version .................... 1.10.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.6.0+9f7126fc, 9f7126fc, HEAD
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3, hip 0.0
python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.9.15 (main, Nov 24 2022, 14:31:59) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti
Nvidia driver version: 527.41
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 h9edb442_10 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py39h7e14d7c_0 conda-forge
[conda] mkl_fft 1.3.1 py39h0c7bc48_1 conda-forge
[conda] mkl_random 1.2.2 py39hde0f152_0 conda-forge
[conda] numpy 1.23.5 py39h14f4228_0
[conda] numpy-base 1.23.5 py39h31eccc5_0
[conda] pytorch 1.10.0 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.10.0 py39_cu113 pytorch
[conda] torchvision 0.11.0 py39_cu113 pytorch
stage2.config
{
"bfloat16": {
"enabled": "auto"
},
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"steps_per_print": 1e5
}
Error Log
[INFO|modeling_utils.py:2065] 2023-01-08 13:39:15,351 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at google/t5-large-lm-adapt.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 783M total params, 32M largest layer params.
per CPU | per GPU | Options
19.69GB | 0.12GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
19.69GB | 0.12GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
17.50GB | 1.58GB | offload_param=none, offload_optimizer=cpu , zero_init=1
17.50GB | 1.58GB | offload_param=none, offload_optimizer=cpu , zero_init=0
0.18GB | 13.25GB | offload_param=none, offload_optimizer=none, zero_init=1
4.38GB | 13.25GB | offload_param=none, offload_optimizer=none, zero_init=0
[INFO|trainer.py:453] 2023-01-08 13:39:15,832 >> Using amp half precision backend
[2023-01-08 13:39:15,834] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+9f7126fc, git-hash=9f7126fc, git-branch=HEAD
[2023-01-08 13:39:18,120] [INFO] [engine.py:277:__init__] DeepSpeed Flops Profiler Enabled: False
Using /home/zihan/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/zihan/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8313558101654053 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-01-08 13:39:21,928] [INFO] [engine.py:1064:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-01-08 13:39:21,946] [INFO] [engine.py:1072:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-01-08 13:39:21,946] [INFO] [utils.py:48:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-01-08 13:39:21,946] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:125:__init__] Reduce bucket size 200000000.0
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:126:__init__] Allgather bucket size 200000000.0
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:127:__init__] CPU Offload: True
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:128:__init__] Round robin gradient partitioning: False
Using /home/zihan/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/zihan/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.12202024459838867 seconds
Rank: 0 partition count [1] and sizes[(783092736, False)]
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp line=280 error=2 : out of memory
Traceback (most recent call last):
File "/home/zihan/project/continual_learning/my_tk_instruct/Tk-Instruct/src/run_s2s.py", line 623, in <module>
main()
File "/home/zihan/project/continual_learning/my_tk_instruct/Tk-Instruct/src/run_s2s.py", line 527, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/transformers/trainer.py", line 1255, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/transformers/deepspeed.py", line 432, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 293, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1088, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1309, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 424, in __init__
self.temp_grad_buffer_for_cpu_offload = torch.zeros(
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp:280
[2023-01-08 13:39:24,964] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 5043
I have the same issue. I can train T5-large on a single 3090 Ti with batch_size=4 without OOM. However, when I tried to use DeepSpeed to speed up training (still on the single 3090 Ti), it said OOM with even batch_size=1, but the memory monitor does not show OOM as well.
I'm followed the doc from HuggingFace. I'm also using WSL2 on windows.
ds_report output
-------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/torch'] torch version .................... 1.10.0 torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed install path ........... ['/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.6.0+9f7126fc, 9f7126fc, HEAD deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3, hip 0.0
python -m torch.utils.collect_env
Collecting environment information... PyTorch version: 1.10.0 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.1 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.9.15 (main, Nov 24 2022, 14:31:59) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti Nvidia driver version: 527.41 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.23.5 [pip3] torch==1.10.0 [pip3] torchaudio==0.10.0 [pip3] torchvision==0.11.0 [conda] blas 1.0 mkl [conda] cudatoolkit 11.3.1 h9edb442_10 conda-forge [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py39h7e14d7c_0 conda-forge [conda] mkl_fft 1.3.1 py39h0c7bc48_1 conda-forge [conda] mkl_random 1.2.2 py39hde0f152_0 conda-forge [conda] numpy 1.23.5 py39h14f4228_0 [conda] numpy-base 1.23.5 py39h31eccc5_0 [conda] pytorch 1.10.0 py3.9_cuda11.3_cudnn8.2.0_0 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 0.10.0 py39_cu113 pytorch [conda] torchvision 0.11.0 py39_cu113 pytorch
stage2.config
{ "bfloat16": { "enabled": "auto" }, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "steps_per_print": 1e5 }
Error Log
[INFO|modeling_utils.py:2065] 2023-01-08 13:39:15,351 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at google/t5-large-lm-adapt. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training. Estimated memory needed for params, optim states and gradients for a: HW: Setup with 1 node, 1 GPU per node. SW: Model with 783M total params, 32M largest layer params. per CPU | per GPU | Options 19.69GB | 0.12GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 19.69GB | 0.12GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 17.50GB | 1.58GB | offload_param=none, offload_optimizer=cpu , zero_init=1 17.50GB | 1.58GB | offload_param=none, offload_optimizer=cpu , zero_init=0 0.18GB | 13.25GB | offload_param=none, offload_optimizer=none, zero_init=1 4.38GB | 13.25GB | offload_param=none, offload_optimizer=none, zero_init=0 [INFO|trainer.py:453] 2023-01-08 13:39:15,832 >> Using amp half precision backend [2023-01-08 13:39:15,834] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+9f7126fc, git-hash=9f7126fc, git-branch=HEAD [2023-01-08 13:39:18,120] [INFO] [engine.py:277:__init__] DeepSpeed Flops Profiler Enabled: False Using /home/zihan/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/zihan/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.8313558101654053 seconds Adam Optimizer #0 is created with AVX2 arithmetic capability. Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 [2023-01-08 13:39:21,928] [INFO] [engine.py:1064:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-01-08 13:39:21,946] [INFO] [engine.py:1072:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-01-08 13:39:21,946] [INFO] [utils.py:48:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-01-08 13:39:21,946] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer [2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:125:__init__] Reduce bucket size 200000000.0 [2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:126:__init__] Allgather bucket size 200000000.0 [2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:127:__init__] CPU Offload: True [2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:128:__init__] Round robin gradient partitioning: False Using /home/zihan/.cache/torch_extensions/py39_cu113 as PyTorch extensions root... Emitting ninja build file /home/zihan/.cache/torch_extensions/py39_cu113/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.12202024459838867 seconds Rank: 0 partition count [1] and sizes[(783092736, False)] THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp line=280 error=2 : out of memory Traceback (most recent call last): File "/home/zihan/project/continual_learning/my_tk_instruct/Tk-Instruct/src/run_s2s.py", line 623, in <module> main() File "/home/zihan/project/continual_learning/my_tk_instruct/Tk-Instruct/src/run_s2s.py", line 527, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/transformers/trainer.py", line 1255, in train deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/transformers/deepspeed.py", line 432, in deepspeed_init deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize engine = DeepSpeedEngine(args=args, File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 293, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1088, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1309, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 424, in __init__ self.temp_grad_buffer_for_cpu_offload = torch.zeros( RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp:280 [2023-01-08 13:39:24,964] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 5043
did you solve the problem, I'm facing a similar problem here...
@lavaaa7, can you please try with latest deepspeed? I notice v0.6.0 in the log. Thanks!
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:462 in init │ │ │ │ 459 │ │ │ #print(torch.cuda.memory.memory_summary()) │ │ 460 │ │ │ print(torch.cuda.memory_allocated()/1024/1024, torch.cuda.max_memory_allocat │ │ 461 │ │ │ print(get_accelerator(), self.device, largest_param_numel/1024/1024, self.dt │ │ ❱ 462 │ │ │ self.temp_grad_buffer_for_cpu_offload = get_accelerator().pin_memory( │ │ 463 │ │ │ │ torch.zeros(largest_param_numel, │ │ 464 │ │ │ │ │ │ │ device=self.device, │ │ 465 │ │ │ │ │ │ │ dtype=self.dtype))
<deepspeed.accelerator.cuda_accelerator.CUDA_Accelerator object at 0x7f388086ef70> device: cpu largest_param_numel:38597376 type: torch.float16
same as running this code,help please
@lpty, can you please share a stack trace?
Based on the original stack trace, it seems the OOM in CPU memory. The screenshot below shows 73% CPU memory usage very early on
Can you try not using offload by removing "offload_optimizer" from ds_config?
@tjruwase
cpu/gpu memory is enough to do this job,i run the model gpt-neo-125M in a nvidia 3090-24g and cpu memory is 64g.
system
win10 with a nvidia 3090-24g and cpu memory is 64g.
docker image
REPOSITORY TAG IMAGE ID CREATED SIZE
deepspeed/deepspeed v072_torch112_cu117 b1d3268ea315 6 months ago 14.7GB
docker run --gpus all --name deepspeed -p 10022:22 --ipc=host --shm-size=16g --ulimit memlock=-1 --privileged -v E:\:/mnt/e -it b1d3268ea315 /bin/bash
free
total used free shared buff/cache available
Mem: 50Gi 792Mi 46Gi 1.0Mi 3.1Gi 48Gi
Swap: 13Gi 0B 13Gi
ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0a0+8a1a93a
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7
python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.12.0a0+8a1a93a
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.23.1
Libc version: glibc-2.31
Python version: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.7.64
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 516.94
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.12.0a0+8a1a93a
[pip3] torch-tensorrt==1.1.0a0
[pip3] torchtext==0.13.0a0
[pip3] torchtyping==0.1.4
[pip3] torchvision==0.13.0a0
[conda] mkl 2019.5 281 conda-forge
[conda] mkl-include 2019.5 281 conda-forge
[conda] numpy 1.22.3 py38h99721a1_2 conda-forge
[conda] pytorch-quantization 2.1.2 pypi_0 pypi
[conda] torch 1.12.0a0+8a1a93a pypi_0 pypi
[conda] torch-tensorrt 1.1.0a0 pypi_0 pypi
[conda] torchtext 0.13.0a0 pypi_0 pypi
[conda] torchtyping 0.1.4 pypi_0 pypi
[conda] torchvision 0.13.0a0 pypi_0 pypi
Track log
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[2023-03-09 02:23:09,449] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-09 02:23:09,494] [INFO] [runner.py:548:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train_gptj_summarize.py
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[2023-03-09 02:23:10,864] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10+cuda11.6
[2023-03-09 02:23:10,864] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-03-09 02:23:10,864] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-03-09 02:23:10,864] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-03-09 02:23:10,864] [INFO] [launch.py:162:main] dist_world_size=1
[2023-03-09 02:23:10,864] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1402: UserWarning: positional arguments and argument "destination" are deprecated. nn.Module.state_dict will not accept them in the future. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
Found cached dataset parquet (/root/.cache/huggingface/datasets/CarperAI___parquet/CarperAI--openai_summarize_tldr-536d9955f5e6f921/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/root/.cache/huggingface/datasets/CarperAI___parquet/CarperAI--openai_summarize_tldr-536d9955f5e6f921/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
[2023-03-09 02:23:33,945] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using True half precision backend
[2023-03-09 02:23:33,996] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2023-03-09 02:23:35,259] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.64772057533264 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
[2023-03-09 02:23:54,821] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-03-09 02:23:54,825] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-03-09 02:23:54,825] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-03-09 02:23:54,825] [INFO] [logging.py:75:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-03-09 02:23:54,825] [INFO] [stage_1_and_2.py:144:__init__] Reduce bucket size 500,000,000
[2023-03-09 02:23:54,825] [INFO] [stage_1_and_2.py:145:__init__] Allgather bucket size 500000000
[2023-03-09 02:23:54,826] [INFO] [stage_1_and_2.py:146:__init__] CPU Offload: True
[2023-03-09 02:23:54,826] [INFO] [stage_1_and_2.py:147:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.08072209358215332 seconds
Rank: 0 partition count [1] and sizes[(125198592, False)]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/e/trlx/examples/summarize_rlhf/sft/train_gptj_summarize.py:112 in <module> │
│ │
│ 109 │ │ data_collator=default_data_collator, │
│ 110 │ │ preprocess_logits_for_metrics=preprocess_logits_for_metrics, │
│ 111 │ ) │
│ ❱ 112 │ trainer.train() │
│ 113 │ trainer.save_model(output_dir) │
│ 114 │
│ │
│ /opt/conda/lib/python3.8/site-packages/transformers/trainer.py:1543 in train │
│ │
│ 1540 │ │ inner_training_loop = find_executable_batch_size( │
│ 1541 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1542 │ │ ) │
│ ❱ 1543 │ │ return inner_training_loop( │
│ 1544 │ │ │ args=args, │
│ 1545 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1546 │ │ │ trial=trial, │
│ │
│ /opt/conda/lib/python3.8/site-packages/transformers/trainer.py:1612 in _inner_training_loop │
│ │
│ 1609 │ │ │ or self.fsdp is not None │
│ 1610 │ │ ) │
│ 1611 │ │ if args.deepspeed: │
│ ❱ 1612 │ │ │ deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( │
│ 1613 │ │ │ │ self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_c │
│ 1614 │ │ │ ) │
│ 1615 │ │ │ self.model = deepspeed_engine.module │
│ │
│ /opt/conda/lib/python3.8/site-packages/transformers/deepspeed.py:344 in deepspeed_init │
│ │
│ 341 │ │ lr_scheduler=lr_scheduler, │
│ 342 │ ) │
│ 343 │ │
│ ❱ 344 │ deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) │
│ 345 │ │
│ 346 │ if resume_from_checkpoint is not None: │
│ 347 │
│ │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py:125 in initialize │
│ │
│ 122 │ assert model is not None, "deepspeed.initialize requires a model" │
│ 123 │ │
│ 124 │ if not isinstance(model, PipelineModule): │
│ ❱ 125 │ │ engine = DeepSpeedEngine(args=args, │
│ 126 │ │ │ │ │ │ │ │ model=model, │
│ 127 │ │ │ │ │ │ │ │ optimizer=optimizer, │
│ 128 │ │ │ │ │ │ │ │ model_parameters=model_parameters, │
│ │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:336 in __init__ │
│ │
│ 333 │ │ │ model_parameters = self.module.parameters() │
│ 334 │ │ │
│ 335 │ │ if has_optimizer: │
│ ❱ 336 │ │ │ self._configure_optimizer(optimizer, model_parameters) │
│ 337 │ │ │ self._configure_lr_scheduler(lr_scheduler) │
│ 338 │ │ │ self._report_progress(0) │
│ 339 │ │ elif self.zero_optimization(): │
│ │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:1292 in _configure_optimizer │
│ │
│ 1289 │ │ optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer) │
│ 1290 │ │ │
│ 1291 │ │ if optimizer_wrapper == ZERO_OPTIMIZATION: │
│ ❱ 1292 │ │ │ self.optimizer = self._configure_zero_optimizer(basic_optimizer) │
│ 1293 │ │ elif optimizer_wrapper == AMP: │
│ 1294 │ │ │ amp_params = self.amp_params() │
│ 1295 │ │ │ log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0]) │
│ │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:1542 in │
│ _configure_zero_optimizer │
│ │
│ 1539 │ │ │ │ │ │ "Pipeline parallelism does not support overlapped communication, │
│ 1540 │ │ │ │ │ ) │
│ 1541 │ │ │ │ │ overlap_comm = False │
│ ❱ 1542 │ │ │ optimizer = DeepSpeedZeroOptimizer( │
│ 1543 │ │ │ │ optimizer, │
│ 1544 │ │ │ │ self.param_names, │
│ 1545 │ │ │ │ timers=timers, │
│ │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:451 in __init__ │
│ │
│ 448 │ │ │ self.norm_for_param_grads = {} │
│ 449 │ │ │ self.local_overflow = False │
│ 450 │ │ │ self.grad_position = {} │
│ ❱ 451 │ │ │ self.temp_grad_buffer_for_cpu_offload = get_accelerator().pin_memory( │
│ 452 │ │ │ │ torch.zeros(largest_param_numel, │
│ 453 │ │ │ │ │ │ │ device=self.device, │
│ 454 │ │ │ │ │ │ │ dtype=self.dtype)) │
│ │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/accelerator/cuda_accelerator.py:214 in │
│ pin_memory │
│ │
│ 211 │ │ return torch.cuda.LongTensor │
│ 212 │ │
│ 213 │ def pin_memory(self, tensor): │
│ ❱ 214 │ │ return tensor.pin_memory() │
│ 215 │ │
│ 216 │ def on_accelerator(self, tensor): │
│ 217 │ │ device_str = str(tensor.device) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[2023-03-09 02:23:56,918] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 735
[2023-03-09 02:23:56,918] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'train_gptj_summarize.py', '--local_rank=0'] exits with return code = 1
ds config
{
"train_batch_size": 1,
"fp16": {
"enabled": true,
"min_loss_scale": 1,
"opt_level": "O2"
},
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"contiguous_gradients": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-05,
"betas": [
0.9,
0.95
],
"eps": 1e-08,
"torch_adam": false
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 1e-05,
"warmup_num_steps": "auto"
}
}
}
is cpu offload not support in wsl2 ? i can run the same code success in another machine with a v100 gpu
@lpty , I see you and @tjruwase are discussing the errors in https://github.com/microsoft/DeepSpeed/issues/2977. And seems the OOM issue only showed on WSL2. Please confirm these, if it is only on WSL2 and is discussing in #2977, I'll close this issue. --thanks, Bing
close for no response from the user, will reopen if needed.