DeepSpeedExamples
DeepSpeedExamples copied to clipboard
OOM despite ZeRO stage 3
Following the example in HelloDeepSpeed, yet I still have to CUDA OOM despite moving all the way to stage 3 on the configuration below.
deepspeed train_bert_ds.py --checkpoint_dir . --num_layers 24 --h_dim 4096
I added offload options to support larger model but it was still OOM on my 16GB V100.
ds_config = {
"train_micro_batch_size_per_gpu": batch_size,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-4
}
},
"fp16": {
"enabled": True
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": True
},
"offload_param": {
"device": "cpu",
"pin_memory": True
}
}
}
Is there anything missing here? Please let me know if there is a documentation on config where the model is too large to be loaded onto a single GPU.
Thank you.
Version info
python --version
Python 3.9.12
python -c "import torch; print(torch.__version__)"
1.12.0+cu116
python -c "import deepspeed; print(deepspeed.__version__)"
0.6.5
Log output
[2022-07-13 21:03:49,576] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-07-13 21:03:49,576] [INFO] [runner.py:457:main] cmd = /working/anaconda3/envs/tmp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train_bert_ds.py --checkpoint_dir . --num_layers 24 --h_dim 4096
[2022-07-13 21:03:50,730] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2022-07-13 21:03:50,731] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=4, node_rank=0
[2022-07-13 21:03:50,731] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2022-07-13 21:03:50,731] [INFO] [launch.py:123:main] dist_world_size=4
[2022-07-13 21:03:50,731] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2022-07-13 21:04:39,659] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 21:04:39,905] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 21:04:39,909] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
[2022-07-13 21:04:40,010] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 21:04:40,289] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 21:04:46,103] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Creating extension directory /home/user/.cache/torch_extensions/py39_cu116/cpu_adam...Creating extension directory /home/user/.cache/torch_extensions/py39_cu116/cpu_adam...Creating extension directory /home/user/.cache/torch_extensions/py39_cu116/cpu_adam...Creating extension directory /home/user/.cache/torch_extensions/py39_cu116/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/user/.cache/torch_extensions/py39_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/working/anaconda3/envs/tmp/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/include -isystem /working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/include/TH -isystem /working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /working/anaconda3/envs/tmp/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /working/anaconda3/envs/tmp/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/working/anaconda3/envs/tmp/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/include -isystem /working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/include/TH -isystem /working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /working/anaconda3/envs/tmp/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -c /working/anaconda3/envs/tmp/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/working/anaconda3/envs/tmp/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Loading extension module cpu_adam...Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 63.22985577583313 seconds
Time to load cpu_adam op: 63.2216637134552 secondsTime to load cpu_adam op: 63.241652727127075 seconds
Time to load cpu_adam op: 63.24095416069031 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000100, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000100, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000100, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000100, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2022-07-13 21:05:51,130] [INFO] [engine.py:1100:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
[2022-07-13 21:05:51,164] [INFO] [engine.py:1108:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2022-07-13 21:05:51,164] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2022-07-13 21:05:51,164] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2022-07-13 21:05:51,164] [INFO] [engine.py:1410:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2022-07-13 21:05:51,174] [INFO] [stage3.py:275:__init__] Reduce bucket size 500000000
[2022-07-13 21:05:51,174] [INFO] [stage3.py:276:__init__] Prefetch bucket size 50000000
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Emitting ninja build file /home/user/.cache/torch_extensions/py39_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.9306187629699707 seconds
Loading extension module utils...Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.9696803092956543 seconds
Time to load utils op: 0.9813756942749023 seconds
Time to load utils op: 1.0037789344787598 seconds
[2022-07-13 21:05:59,485] [INFO] [stage3.py:567:_setup_for_real_optimizer] optimizer state initialized
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.03724932670593262 seconds
Time to load utils op: 0.037149906158447266 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.037335872650146484 seconds
[2022-07-13 21:06:00,737] [INFO] [utils.py:828:see_memory_usage] After initializing ZeRO optimizer
[2022-07-13 21:06:00,738] [INFO] [utils.py:829:see_memory_usage] MA 0.93 GB Max_MA 3.64 GB CA 5.35 GB Max_CA 5 GB
[2022-07-13 21:06:00,738] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 56.33 GB, percent = 23.5%
[2022-07-13 21:06:00,738] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2022-07-13 21:06:00,738] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2022-07-13 21:06:00,738] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-07-13 21:06:00,739] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[(0.9, 0.999)]
[2022-07-13 21:06:00,739] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] amp_enabled .................. False
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] amp_params ................... False
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": null,
"exps_dir": null,
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] bfloat16_enabled ............. False
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] checkpoint_tag_validation_enabled True
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] checkpoint_tag_validation_fail False
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] communication_data_type ...... None
[2022-07-13 21:06:00,740] [INFO] [config.py:1063:print] curriculum_enabled ........... False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] curriculum_params ............ False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] dataloader_drop_last ......... False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] disable_allgather ............ False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] dump_state ................... False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] dynamic_loss_scale_args ...... None
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] eigenvalue_enabled ........... False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] eigenvalue_gas_boundary_resolution 1
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] eigenvalue_layer_name ........ bert.encoder.layer
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] eigenvalue_layer_num ......... 0
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] eigenvalue_max_iter .......... 100
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] eigenvalue_stability ......... 1e-06
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] eigenvalue_tol ............... 0.01
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] eigenvalue_verbose ........... False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] elasticity_enabled ........... False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] fp16_enabled ................. True
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] fp16_master_weights_and_gradients False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] fp16_mixed_quantize .......... False
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] global_rank .................. 0
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] gradient_accumulation_steps .. 1
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] gradient_clipping ............ 0.0
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] gradient_predivide_factor .... 1.0
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] initial_dynamic_scale ........ 4294967296
[2022-07-13 21:06:00,741] [INFO] [config.py:1063:print] loss_scale ................... 0
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] memory_breakdown ............. False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] optimizer_legacy_fusion ...... False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] optimizer_name ............... adam
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] optimizer_params ............. {'lr': 0.0001}
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] pld_enabled .................. False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] pld_params ................... False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] prescale_gradients ........... False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_change_rate ......... 0.001
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_groups .............. 1
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_offset .............. 1000
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_period .............. 1000
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_rounding ............ 0
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_start_bits .......... 16
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_target_bits ......... 8
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_training_enabled .... False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_type ................ 0
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] quantize_verbose ............. False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] scheduler_name ............... None
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] scheduler_params ............. None
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] sparse_attention ............. None
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] sparse_gradients_enabled ..... False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] steps_per_print .............. 10
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] tensorboard_enabled .......... False
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] tensorboard_job_name ......... DeepSpeedJobName
[2022-07-13 21:06:00,742] [INFO] [config.py:1063:print] tensorboard_output_path ......
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] train_batch_size ............. 32
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] train_micro_batch_size_per_gpu 8
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] use_quantizer_kernel ......... False
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] wall_clock_breakdown ......... False
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] world_size ................... 4
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] zero_allow_untested_optimizer False
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] zero_config .................. {
"stage": 3,
"contiguous_gradients": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"load_from_fp32_weights": true,
"elastic_checkpoint": false,
"offload_param": {
"device": "cpu",
"nvme_path": null,
"buffer_count": 5,
"buffer_size": 1.000000e+08,
"max_in_cpu": 1.000000e+09,
"pin_memory": true
},
"offload_optimizer": {
"device": "cpu",
"nvme_path": null,
"buffer_count": 4,
"pin_memory": true,
"pipeline_read": false,
"pipeline_write": false,
"fast_init": false,
"pipeline": false
},
"sub_group_size": 1.000000e+09,
"prefetch_bucket_size": 5.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_16bit_weights_on_model_save": false,
"ignore_unused_parameters": true,
"round_robin_gradients": false,
"legacy_stage1": false
}
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] zero_enabled ................. True
[2022-07-13 21:06:00,743] [INFO] [config.py:1063:print] zero_optimization_stage ...... 3
[2022-07-13 21:06:00,743] [INFO] [config.py:1065:print] json = {
"train_micro_batch_size_per_gpu": 8,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0001
}
},
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
}
}
}
Using /home/user/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00040459632873535156 seconds
[2022-07-13 21:06:03,821] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
[2022-07-13 21:06:09,157] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
[2022-07-13 21:06:12,234] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-07-13 21:06:14,997] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-07-13 21:06:17,642] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-07-13 21:06:20,375] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-07-13 21:06:23,449] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-07-13 21:06:26,188] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-07-13 21:06:29,001] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-07-13 21:06:31,773] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-07-13 21:06:31,774] [INFO] [logging.py:69:log_dist] [Rank 0] step=10, skipped=10, lr=[0.0001], mom=[(0.9, 0.999)]
[2022-07-13 21:06:31,774] [INFO] [timer.py:193:stop] 0/10, SamplesPerSec=11.377975746114256, MemAllocated=0.93GB, MaxMemAllocated=11.82GB
[2022-07-13 21:06:34,666] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-07-13 21:06:37,427] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-07-13 21:06:40,155] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-07-13 21:06:42,967] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-07-13 21:06:45,659] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-07-13 21:06:48,232] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
[2022-07-13 21:06:50,825] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
[2022-07-13 21:06:55,240] [WARNING] [stage3.py:2391:step] 8 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:07:03,688] [WARNING] [stage3.py:2391:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:07:03,688] [INFO] [logging.py:69:log_dist] [Rank 0] step=20, skipped=17, lr=[0.0001], mom=[(0.9, 0.999)]
[2022-07-13 21:07:03,689] [INFO] [timer.py:193:stop] 0/20, SamplesPerSec=10.613364425206317, MemAllocated=0.93GB, MaxMemAllocated=11.82GB
[2022-07-13 21:07:10,470] [INFO] [stage3.py:2281:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
[2022-07-13 21:07:15,108] [WARNING] [stage3.py:2391:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:07:44,458] [WARNING] [stage3.py:2391:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:07:44,459] [INFO] [logging.py:69:log_dist] [Rank 0] step=30, skipped=18, lr=[0.0001], mom=[(0.9, 0.999)]
[2022-07-13 21:07:44,459] [INFO] [timer.py:193:stop] 0/30, SamplesPerSec=9.442885534716904, MemAllocated=0.93GB, MaxMemAllocated=11.82GB
[2022-07-13 21:07:49,210] [WARNING] [stage3.py:2391:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:07:58,024] [WARNING] [stage3.py:2391:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:08:26,711] [INFO] [logging.py:69:log_dist] [Rank 0] step=40, skipped=18, lr=[0.0001], mom=[(0.9, 0.999)]
[2022-07-13 21:08:26,711] [INFO] [timer.py:193:stop] 0/40, SamplesPerSec=8.876825164208146, MemAllocated=0.93GB, MaxMemAllocated=12.76GB
[2022-07-13 21:08:47,866] [WARNING] [stage3.py:2391:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:09:00,475] [WARNING] [stage3.py:2391:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:09:08,702] [INFO] [logging.py:69:log_dist] [Rank 0] step=50, skipped=18, lr=[0.0001], mom=[(0.9, 0.999)]
[2022-07-13 21:09:08,702] [INFO] [timer.py:193:stop] 0/50, SamplesPerSec=8.589520405326056, MemAllocated=0.93GB, MaxMemAllocated=12.76GB
[2022-07-13 21:09:13,651] [WARNING] [stage3.py:2391:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:09:34,759] [WARNING] [stage3.py:2391:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:09:43,405] [WARNING] [stage3.py:2391:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:09:47,868] [WARNING] [stage3.py:2391:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2022-07-13 21:09:51,932] [INFO] [logging.py:69:log_dist] [Rank 0] step=60, skipped=18, lr=[0.0001], mom=[(0.9, 0.999)]
[2022-07-13 21:09:51,932] [INFO] [timer.py:193:stop] 0/60, SamplesPerSec=8.36398552618659, MemAllocated=0.93GB, MaxMemAllocated=12.76GB
[2022-07-13 21:10:08,189] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 50086
[2022-07-13 21:10:08,189] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 50087
[2022-07-13 21:10:08,190] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 50088
[2022-07-13 21:10:08,190] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 50089
[2022-07-13 21:10:08,190] [ERROR] [launch.py:184:sigkill_handler] ['/working/anaconda3/envs/tmp/bin/python', '-u', 'train_bert_ds.py', '--local_rank=3', '--checkpoint_dir', '.', '--num_layers', '24', '--h_dim', '4096'] exits with return code = 1