DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] OOM error during setup, but the OS GPU memory monitor does not show an OOM condition

Open vmajor opened this issue 2 years ago • 6 comments

Describe the bug When attempting to run this I get an OOM:

accelerate launch train_dreambooth.py --pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 --instance_data_dir=./inputs --output_dir=./outputs --instance_prompt="a photo of sks dog" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=5e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=400

Loading extension module utils...
Time to load utils op: 19.92158341407776 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)]
[2023-01-07 20:45:30,471] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2023-01-07 20:45:30,472] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2023-01-07 20:45:30,472] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 45.9 GB, percent = 73.2%
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/*****/diffusers/examples/dreambooth/train_dreambooth.py:828 in <module>                   │
│                                                                                                  │
│   825                                                                                            │
│   826 if __name__ == "__main__":                                                                 │
│   827 │   args = parse_args()                                                                    │
│ ❱ 828 │   main(args)                                                                             │
│   829                                                                                            │
│                                                                                                  │
│ /home/*****/diffusers/examples/dreambooth/train_dreambooth.py:657 in main                       │
│                                                                                                  │
│   654 │   │   │   unet, text_encoder, optimizer, train_dataloader, lr_scheduler                  │
│   655 │   │   )                                                                                  │
│   656 │   else:                                                                                  │
│ ❱ 657 │   │   unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(             │
│   658 │   │   │   unet, optimizer, train_dataloader, lr_scheduler                                │
│   659 │   │   )                                                                                  │
│   660                                                                                            │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/accelerator.py:872 in │
│ prepare                                                                                          │
│                                                                                                  │
│    869 │   │   │   old_named_params = self._get_named_parameters(*args)                          │
│    870 │   │                                                                                     │
│    871 │   │   if self.distributed_type == DistributedType.DEEPSPEED:                            │
│ ❱  872 │   │   │   result = self._prepare_deepspeed(*args)                                       │
│    873 │   │   elif self.distributed_type == DistributedType.MEGATRON_LM:                        │
│    874 │   │   │   result = self._prepare_megatron_lm(*args)                                     │
│    875 │   │   else:                                                                             │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/accelerator.py:1093   │
│ in _prepare_deepspeed                                                                            │
│                                                                                                  │
│   1090 │   │   │   │   │   │   if type(scheduler).__name__ in deepspeed.runtime.lr_schedules.VA  │
│   1091 │   │   │   │   │   │   │   kwargs["lr_scheduler"] = scheduler                            │
│   1092 │   │   │                                                                                 │
│ ❱ 1093 │   │   │   engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)           │
│   1094 │   │   │   if optimizer is not None:                                                     │
│   1095 │   │   │   │   optimizer = DeepSpeedOptimizerWrapper(optimizer)                          │
│   1096 │   │   │   if scheduler is not None:                                                     │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/__init__.py:125 in     │
│ initialize                                                                                       │
│                                                                                                  │
│   122 │   assert model is not None, "deepspeed.initialize requires a model"                      │
│   123 │                                                                                          │
│   124 │   if not isinstance(model, PipelineModule):                                              │
│ ❱ 125 │   │   engine = DeepSpeedEngine(args=args,                                                │
│   126 │   │   │   │   │   │   │   │    model=model,                                              │
│   127 │   │   │   │   │   │   │   │    optimizer=optimizer,                                      │
│   128 │   │   │   │   │   │   │   │    model_parameters=model_parameters,                        │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/engine.py:330  │
│ in __init__                                                                                      │
│                                                                                                  │
│    327 │   │   │   model_parameters = self.module.parameters()                                   │
│    328 │   │                                                                                     │
│    329 │   │   if has_optimizer:                                                                 │
│ ❱  330 │   │   │   self._configure_optimizer(optimizer, model_parameters)                        │
│    331 │   │   │   self._configure_lr_scheduler(lr_scheduler)                                    │
│    332 │   │   │   self._report_progress(0)                                                      │
│    333 │   │   elif self.zero_optimization():                                                    │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1210 │
│ in _configure_optimizer                                                                          │
│                                                                                                  │
│   1207 │   │   optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer)              │
│   1208 │   │                                                                                     │
│   1209 │   │   if optimizer_wrapper == ZERO_OPTIMIZATION:                                        │
│ ❱ 1210 │   │   │   self.optimizer = self._configure_zero_optimizer(basic_optimizer)              │
│   1211 │   │   elif optimizer_wrapper == AMP:                                                    │
│   1212 │   │   │   amp_params = self.amp_params()                                                │
│   1213 │   │   │   log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0])      │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1455 │
│ in _configure_zero_optimizer                                                                     │
│                                                                                                  │
│   1452 │   │   │   │   │   │   "Pipeline parallelism does not support overlapped communication,  │
│   1453 │   │   │   │   │   )                                                                     │
│   1454 │   │   │   │   │   overlap_comm = False                                                  │
│ ❱ 1455 │   │   │   optimizer = DeepSpeedZeroOptimizer(                                           │
│   1456 │   │   │   │   optimizer,                                                                │
│   1457 │   │   │   │   self.param_names,                                                         │
│   1458 │   │   │   │   timers=timers,                                                            │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_a │
│ nd_2.py:521 in __init__                                                                          │
│                                                                                                  │
│    518 │   │   │   self.dynamic_loss_scale = True                                                │
│    519 │   │                                                                                     │
│    520 │   │   see_memory_usage("Before initializing optimizer states", force=True)              │
│ ❱  521 │   │   self.initialize_optimizer_states()                                                │
│    522 │   │   see_memory_usage("After initializing optimizer states", force=True)               │
│    523 │   │                                                                                     │
│    524 │   │   if dist.get_rank() == 0:                                                          │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_a │
│ nd_2.py:644 in initialize_optimizer_states                                                       │
│                                                                                                  │
│    641 │   │   │   │   dtype=self.single_partition_of_fp32_groups[i].dtype,                      │
│    642 │   │   │   │   device=self.device)                                                       │
│    643 │   │   │   self.single_partition_of_fp32_groups[                                         │
│ ❱  644 │   │   │   │   i].grad = single_grad_partition.pin_memory(                               │
│    645 │   │   │   │   ) if self.cpu_offload else single_grad_partition                          │
│    646 │   │                                                                                     │
│    647 │   │   self.optimizer.step()                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

RuntimeError: CUDA error: out of memory

Expected behavior I'd expect the training app to run, and for the GPU memory to be properly managed

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/torch']
torch version .................... 1.13.0
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.7.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

Screenshots image

System info (please complete the following information):

  • OS: Windows 10 WSL2 5.15.79.1-microsoft-standard-WSL2
  • GPU 1 x RTX 3060
  • Python version 3.10.8

Launcher context WSL2

vmajor avatar Jan 07 '23 12:01 vmajor

I have the same issue. I can train T5-large on a single 3090 Ti with batch_size=4 without OOM. However, when I tried to use DeepSpeed to speed up training (still on the single 3090 Ti), it said OOM with even batch_size=1, but the memory monitor does not show OOM as well.

I'm followed the doc from HuggingFace. I'm also using WSL2 on windows.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/torch']
torch version .................... 1.10.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.6.0+9f7126fc, 9f7126fc, HEAD
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3, hip 0.0

python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.9.15 (main, Nov 24 2022, 14:31:59)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti
Nvidia driver version: 527.41
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1              h9edb442_10    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py39h7e14d7c_0    conda-forge
[conda] mkl_fft                   1.3.1            py39h0c7bc48_1    conda-forge
[conda] mkl_random                1.2.2            py39hde0f152_0    conda-forge
[conda] numpy                     1.23.5           py39h14f4228_0
[conda] numpy-base                1.23.5           py39h31eccc5_0
[conda] pytorch                   1.10.0          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.10.0               py39_cu113    pytorch
[conda] torchvision               0.11.0               py39_cu113    pytorch

stage2.config

{
    "bfloat16": {
        "enabled": "auto"
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 1e5
}

Error Log

[INFO|modeling_utils.py:2065] 2023-01-08 13:39:15,351 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at google/t5-large-lm-adapt.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 783M total params, 32M largest layer params.
  per CPU  |  per GPU |   Options
   19.69GB |   0.12GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
   19.69GB |   0.12GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
   17.50GB |   1.58GB | offload_param=none, offload_optimizer=cpu , zero_init=1
   17.50GB |   1.58GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.18GB |  13.25GB | offload_param=none, offload_optimizer=none, zero_init=1
    4.38GB |  13.25GB | offload_param=none, offload_optimizer=none, zero_init=0
[INFO|trainer.py:453] 2023-01-08 13:39:15,832 >> Using amp half precision backend
[2023-01-08 13:39:15,834] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+9f7126fc, git-hash=9f7126fc, git-branch=HEAD
[2023-01-08 13:39:18,120] [INFO] [engine.py:277:__init__] DeepSpeed Flops Profiler Enabled: False
Using /home/zihan/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/zihan/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8313558101654053 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-01-08 13:39:21,928] [INFO] [engine.py:1064:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-01-08 13:39:21,946] [INFO] [engine.py:1072:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-01-08 13:39:21,946] [INFO] [utils.py:48:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-01-08 13:39:21,946] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:125:__init__] Reduce bucket size 200000000.0
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:126:__init__] Allgather bucket size 200000000.0
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:127:__init__] CPU Offload: True
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:128:__init__] Round robin gradient partitioning: False
Using /home/zihan/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/zihan/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.12202024459838867 seconds
Rank: 0 partition count [1] and sizes[(783092736, False)]
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp line=280 error=2 : out of memory
Traceback (most recent call last):
  File "/home/zihan/project/continual_learning/my_tk_instruct/Tk-Instruct/src/run_s2s.py", line 623, in <module>
    main()
  File "/home/zihan/project/continual_learning/my_tk_instruct/Tk-Instruct/src/run_s2s.py", line 527, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/transformers/trainer.py", line 1255, in train
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/transformers/deepspeed.py", line 432, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 293, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1088, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1309, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 424, in __init__
    self.temp_grad_buffer_for_cpu_offload = torch.zeros(
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp:280
[2023-01-08 13:39:24,964] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 5043

ZhangzihanGit avatar Jan 08 '23 02:01 ZhangzihanGit

I have the same issue. I can train T5-large on a single 3090 Ti with batch_size=4 without OOM. However, when I tried to use DeepSpeed to speed up training (still on the single 3090 Ti), it said OOM with even batch_size=1, but the memory monitor does not show OOM as well.

I'm followed the doc from HuggingFace. I'm also using WSL2 on windows.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/torch']
torch version .................... 1.10.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.6.0+9f7126fc, 9f7126fc, HEAD
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3, hip 0.0

python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.9.15 (main, Nov 24 2022, 14:31:59)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti
Nvidia driver version: 527.41
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1              h9edb442_10    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py39h7e14d7c_0    conda-forge
[conda] mkl_fft                   1.3.1            py39h0c7bc48_1    conda-forge
[conda] mkl_random                1.2.2            py39hde0f152_0    conda-forge
[conda] numpy                     1.23.5           py39h14f4228_0
[conda] numpy-base                1.23.5           py39h31eccc5_0
[conda] pytorch                   1.10.0          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.10.0               py39_cu113    pytorch
[conda] torchvision               0.11.0               py39_cu113    pytorch

stage2.config

{
    "bfloat16": {
        "enabled": "auto"
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 1e5
}

Error Log

[INFO|modeling_utils.py:2065] 2023-01-08 13:39:15,351 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at google/t5-large-lm-adapt.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 783M total params, 32M largest layer params.
  per CPU  |  per GPU |   Options
   19.69GB |   0.12GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
   19.69GB |   0.12GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
   17.50GB |   1.58GB | offload_param=none, offload_optimizer=cpu , zero_init=1
   17.50GB |   1.58GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.18GB |  13.25GB | offload_param=none, offload_optimizer=none, zero_init=1
    4.38GB |  13.25GB | offload_param=none, offload_optimizer=none, zero_init=0
[INFO|trainer.py:453] 2023-01-08 13:39:15,832 >> Using amp half precision backend
[2023-01-08 13:39:15,834] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.0+9f7126fc, git-hash=9f7126fc, git-branch=HEAD
[2023-01-08 13:39:18,120] [INFO] [engine.py:277:__init__] DeepSpeed Flops Profiler Enabled: False
Using /home/zihan/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/zihan/.cache/torch_extensions/py39_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8313558101654053 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-01-08 13:39:21,928] [INFO] [engine.py:1064:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-01-08 13:39:21,946] [INFO] [engine.py:1072:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-01-08 13:39:21,946] [INFO] [utils.py:48:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-01-08 13:39:21,946] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:125:__init__] Reduce bucket size 200000000.0
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:126:__init__] Allgather bucket size 200000000.0
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:127:__init__] CPU Offload: True
[2023-01-08 13:39:21,947] [INFO] [stage_1_and_2.py:128:__init__] Round robin gradient partitioning: False
Using /home/zihan/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /home/zihan/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.12202024459838867 seconds
Rank: 0 partition count [1] and sizes[(783092736, False)]
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp line=280 error=2 : out of memory
Traceback (most recent call last):
  File "/home/zihan/project/continual_learning/my_tk_instruct/Tk-Instruct/src/run_s2s.py", line 623, in <module>
    main()
  File "/home/zihan/project/continual_learning/my_tk_instruct/Tk-Instruct/src/run_s2s.py", line 527, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/transformers/trainer.py", line 1255, in train
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/transformers/deepspeed.py", line 432, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 293, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1088, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1309, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/zihan/miniconda3/envs/tk_instruct/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 424, in __init__
    self.temp_grad_buffer_for_cpu_offload = torch.zeros(
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/THC/THCCachingHostAllocator.cpp:280
[2023-01-08 13:39:24,964] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 5043

did you solve the problem, I'm facing a similar problem here...

xinj7 avatar Feb 28 '23 10:02 xinj7

@lavaaa7, can you please try with latest deepspeed? I notice v0.6.0 in the log. Thanks!

tjruwase avatar Feb 28 '23 19:02 tjruwase

│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:462 in init │ │ │ │ 459 │ │ │ #print(torch.cuda.memory.memory_summary()) │ │ 460 │ │ │ print(torch.cuda.memory_allocated()/1024/1024, torch.cuda.max_memory_allocat │ │ 461 │ │ │ print(get_accelerator(), self.device, largest_param_numel/1024/1024, self.dt │ │ ❱ 462 │ │ │ self.temp_grad_buffer_for_cpu_offload = get_accelerator().pin_memory( │ │ 463 │ │ │ │ torch.zeros(largest_param_numel, │ │ 464 │ │ │ │ │ │ │ device=self.device, │ │ 465 │ │ │ │ │ │ │ dtype=self.dtype))

<deepspeed.accelerator.cuda_accelerator.CUDA_Accelerator object at 0x7f388086ef70> device: cpu largest_param_numel:38597376 type: torch.float16

same as running this code,help please

lpty avatar Mar 07 '23 11:03 lpty

@lpty, can you please share a stack trace?

Based on the original stack trace, it seems the OOM in CPU memory. The screenshot below shows 73% CPU memory usage very early on image

Can you try not using offload by removing "offload_optimizer" from ds_config?

tjruwase avatar Mar 07 '23 17:03 tjruwase

@tjruwase cpu/gpu memory is enough to do this job,i run the model gpt-neo-125M in a nvidia 3090-24g and cpu memory is 64g. system win10 with a nvidia 3090-24g and cpu memory is 64g. docker image

REPOSITORY            TAG                   IMAGE ID       CREATED         SIZE
deepspeed/deepspeed   v072_torch112_cu117   b1d3268ea315   6 months ago    14.7GB

docker run --gpus all --name deepspeed -p 10022:22 --ipc=host --shm-size=16g --ulimit memlock=-1 --privileged -v E:\:/mnt/e -it b1d3268ea315 /bin/bash

free

              total        used        free      shared  buff/cache   available
Mem:           50Gi       792Mi        46Gi       1.0Mi       3.1Gi        48Gi
Swap:          13Gi          0B        13Gi

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0a0+8a1a93a
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7

python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.12.0a0+8a1a93a
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.23.1
Libc version: glibc-2.31

Python version: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.7.64
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 516.94
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.12.0a0+8a1a93a
[pip3] torch-tensorrt==1.1.0a0
[pip3] torchtext==0.13.0a0
[pip3] torchtyping==0.1.4
[pip3] torchvision==0.13.0a0
[conda] mkl                       2019.5                      281    conda-forge
[conda] mkl-include               2019.5                      281    conda-forge
[conda] numpy                     1.22.3           py38h99721a1_2    conda-forge
[conda] pytorch-quantization      2.1.2                    pypi_0    pypi
[conda] torch                     1.12.0a0+8a1a93a          pypi_0    pypi
[conda] torch-tensorrt            1.1.0a0                  pypi_0    pypi
[conda] torchtext                 0.13.0a0                 pypi_0    pypi
[conda] torchtyping               0.1.4                    pypi_0    pypi
[conda] torchvision               0.13.0a0                 pypi_0    pypi

Track log

/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[2023-03-09 02:23:09,449] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-09 02:23:09,494] [INFO] [runner.py:548:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train_gptj_summarize.py
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[2023-03-09 02:23:10,864] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10+cuda11.6
[2023-03-09 02:23:10,864] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-03-09 02:23:10,864] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-03-09 02:23:10,864] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-03-09 02:23:10,864] [INFO] [launch.py:162:main] dist_world_size=1
[2023-03-09 02:23:10,864] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1402: UserWarning: positional arguments and argument "destination" are deprecated. nn.Module.state_dict will not accept them in the future. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
Found cached dataset parquet (/root/.cache/huggingface/datasets/CarperAI___parquet/CarperAI--openai_summarize_tldr-536d9955f5e6f921/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/root/.cache/huggingface/datasets/CarperAI___parquet/CarperAI--openai_summarize_tldr-536d9955f5e6f921/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
[2023-03-09 02:23:33,945] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using True half precision backend
[2023-03-09 02:23:33,996] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2023-03-09 02:23:35,259] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.64772057533264 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
[2023-03-09 02:23:54,821] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-03-09 02:23:54,825] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-03-09 02:23:54,825] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-03-09 02:23:54,825] [INFO] [logging.py:75:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-03-09 02:23:54,825] [INFO] [stage_1_and_2.py:144:__init__] Reduce bucket size 500,000,000
[2023-03-09 02:23:54,825] [INFO] [stage_1_and_2.py:145:__init__] Allgather bucket size 500000000
[2023-03-09 02:23:54,826] [INFO] [stage_1_and_2.py:146:__init__] CPU Offload: True
[2023-03-09 02:23:54,826] [INFO] [stage_1_and_2.py:147:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.08072209358215332 seconds
Rank: 0 partition count [1] and sizes[(125198592, False)]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/e/trlx/examples/summarize_rlhf/sft/train_gptj_summarize.py:112 in <module>                  │
│                                                                                                  │
│   109 │   │   data_collator=default_data_collator,                                               │
│   110 │   │   preprocess_logits_for_metrics=preprocess_logits_for_metrics,                       │
│   111 │   )                                                                                      │
│ ❱ 112 │   trainer.train()                                                                        │
│   113 │   trainer.save_model(output_dir)                                                         │
│   114                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/transformers/trainer.py:1543 in train                     │
│                                                                                                  │
│   1540 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1541 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1542 │   │   )                                                                                 │
│ ❱ 1543 │   │   return inner_training_loop(                                                       │
│   1544 │   │   │   args=args,                                                                    │
│   1545 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1546 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/transformers/trainer.py:1612 in _inner_training_loop      │
│                                                                                                  │
│   1609 │   │   │   or self.fsdp is not None                                                      │
│   1610 │   │   )                                                                                 │
│   1611 │   │   if args.deepspeed:                                                                │
│ ❱ 1612 │   │   │   deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(                   │
│   1613 │   │   │   │   self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_c  │
│   1614 │   │   │   )                                                                             │
│   1615 │   │   │   self.model = deepspeed_engine.module                                          │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/transformers/deepspeed.py:344 in deepspeed_init           │
│                                                                                                  │
│   341 │   │   lr_scheduler=lr_scheduler,                                                         │
│   342 │   )                                                                                      │
│   343 │                                                                                          │
│ ❱ 344 │   deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)          │
│   345 │                                                                                          │
│   346 │   if resume_from_checkpoint is not None:                                                 │
│   347                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py:125 in initialize                   │
│                                                                                                  │
│   122 │   assert model is not None, "deepspeed.initialize requires a model"                      │
│   123 │                                                                                          │
│   124 │   if not isinstance(model, PipelineModule):                                              │
│ ❱ 125 │   │   engine = DeepSpeedEngine(args=args,                                                │
│   126 │   │   │   │   │   │   │   │    model=model,                                              │
│   127 │   │   │   │   │   │   │   │    optimizer=optimizer,                                      │
│   128 │   │   │   │   │   │   │   │    model_parameters=model_parameters,                        │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:336 in __init__               │
│                                                                                                  │
│    333 │   │   │   model_parameters = self.module.parameters()                                   │
│    334 │   │                                                                                     │
│    335 │   │   if has_optimizer:                                                                 │
│ ❱  336 │   │   │   self._configure_optimizer(optimizer, model_parameters)                        │
│    337 │   │   │   self._configure_lr_scheduler(lr_scheduler)                                    │
│    338 │   │   │   self._report_progress(0)                                                      │
│    339 │   │   elif self.zero_optimization():                                                    │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:1292 in _configure_optimizer  │
│                                                                                                  │
│   1289 │   │   optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer)              │
│   1290 │   │                                                                                     │
│   1291 │   │   if optimizer_wrapper == ZERO_OPTIMIZATION:                                        │
│ ❱ 1292 │   │   │   self.optimizer = self._configure_zero_optimizer(basic_optimizer)              │
│   1293 │   │   elif optimizer_wrapper == AMP:                                                    │
│   1294 │   │   │   amp_params = self.amp_params()                                                │
│   1295 │   │   │   log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0])      │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:1542 in                       │
│ _configure_zero_optimizer                                                                        │
│                                                                                                  │
│   1539 │   │   │   │   │   │   "Pipeline parallelism does not support overlapped communication,  │
│   1540 │   │   │   │   │   )                                                                     │
│   1541 │   │   │   │   │   overlap_comm = False                                                  │
│ ❱ 1542 │   │   │   optimizer = DeepSpeedZeroOptimizer(                                           │
│   1543 │   │   │   │   optimizer,                                                                │
│   1544 │   │   │   │   self.param_names,                                                         │
│   1545 │   │   │   │   timers=timers,                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:451 in __init__   │
│                                                                                                  │
│    448 │   │   │   self.norm_for_param_grads = {}                                                │
│    449 │   │   │   self.local_overflow = False                                                   │
│    450 │   │   │   self.grad_position = {}                                                       │
│ ❱  451 │   │   │   self.temp_grad_buffer_for_cpu_offload = get_accelerator().pin_memory(         │
│    452 │   │   │   │   torch.zeros(largest_param_numel,                                          │
│    453 │   │   │   │   │   │   │   device=self.device,                                           │
│    454 │   │   │   │   │   │   │   dtype=self.dtype))                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/accelerator/cuda_accelerator.py:214 in          │
│ pin_memory                                                                                       │
│                                                                                                  │
│   211 │   │   return torch.cuda.LongTensor                                                       │
│   212 │                                                                                          │
│   213 │   def pin_memory(self, tensor):                                                          │
│ ❱ 214 │   │   return tensor.pin_memory()                                                         │
│   215 │                                                                                          │
│   216 │   def on_accelerator(self, tensor):                                                      │
│   217 │   │   device_str = str(tensor.device)                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[2023-03-09 02:23:56,918] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 735
[2023-03-09 02:23:56,918] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'train_gptj_summarize.py', '--local_rank=0'] exits with return code = 1

ds config

{
  "train_batch_size": 1,
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O2"
  },
  "zero_optimization": {
    "stage": 2,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "contiguous_gradients": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-05,
      "betas": [
        0.9,
        0.95
      ],
      "eps": 1e-08,
      "torch_adam": false
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-05,
      "warmup_num_steps": "auto"
    }
  }
}

is cpu offload not support in wsl2 ? i can run the same code success in another machine with a v100 gpu

lpty avatar Mar 09 '23 02:03 lpty

@lpty , I see you and @tjruwase are discussing the errors in https://github.com/microsoft/DeepSpeed/issues/2977. And seems the OOM issue only showed on WSL2. Please confirm these, if it is only on WSL2 and is discussing in #2977, I'll close this issue. --thanks, Bing

xiexbing avatar Mar 27 '23 06:03 xiexbing

close for no response from the user, will reopen if needed.

xiexbing avatar Apr 06 '23 18:04 xiexbing