DeepSpeed [BUG]deepspeed-zero-2-cpu-offloading-killing-process-9-error

Hi,

I am using deepspeed zero-2 with cpu offloading for finetuning LLM model. I keep getting error like this without any detail error description.

[2023-10-26 17:54:44,801] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2454240
[2023-10-26 17:54:48,155] [ERROR] [launch.py:321:sigkill_handler] ['/data_new/sjy98/polyglot-ko/data-parallel/deepspeed-venv/bin/python3', '-u', 'deepspeed-trainer.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_2.json'] exits with return code = -9

I found out that error message like this is usually because of memory issues. The point is that when I reduce my train data size 54000 to 30000, it works fine. However, somehow I keep getting error when I increase my train data size again.

It is little hard to believe that the length of train data causes memory issue, but is it possible when using deepspeed zero-2 cpu offload?

Below is my deepspeed_config.

{
    "fp16": {
        "enabled": false
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "cpu_offload": true
    },
    "communication_data_type": "fp32",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Also, I am using 4 A100 80G GPU for parallel training. Any suggestion or thoughts would be very helpful. Thanks!

Oct 26 '23 09:10 JiyouShin

Hi @JiyouShin , I got the same error. Have you found any solutions?

Dec 21 '23 07:12 ChocoWu

In my program, the bug seems to be triggered that there is insufficient CPU memory when saving the checkpoint.

Dec 21 '23 08:12 ChocoWu

Any update on this issue. I got the same error.

[2024-01-22 16:50:44,316] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.4 GB, percent = 84.6% [2024-01-22 16:50:49,553] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1080504

I am fine-tuning GPT-3 6.7B with a single GPU of RTX 3090 24G memory.

This is my config file:

{ "train_batch_size" : CONFIG_BATCH_SIZE, "train_micro_batch_size_per_gpu": CONFIG_MBSIZE, "steps_per_print": LOG_INTERVAL, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "nvme", "nvme_path": "nvme", "pin_memory": true, "ratio": 0.3, "buffer_count": 4, "fast_init": false }, "offload_param": { "device": "nvme", "nvme_path": "nvme", "pin_memory": true, "buffer_count": 5, "buffer_size": 1e9, "max_in_cpu": 1e9 },

"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": 0,
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e8,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true

}, "gradient_clipping": 1.0, "prescale_gradients":false,

"fp16": { "enabled": CONFIG_FP16_ENABLED, "loss_scale": 0, "loss_scale_window": 500, "hysteresis": 2, "min_loss_scale": 1, "initial_scale_power": 11 },

"bf16": { "enabled": CONFIG_BF16_ENABLED }, "wall_clock_breakdown" : false }

This is tracking history:

[2024-01-22 16:50:41,962] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.7+870ae041, git-hash=870ae041, git-branch=master [2024-01-22 16:50:42,005] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-01-22 16:50:42,006] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2024-01-22 16:50:42,006] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-01-22 16:50:42,012] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2024-01-22 16:50:42,012] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2024-01-22 16:50:42,012] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2024-01-22 16:50:42,012] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2024-01-22 16:50:42,049] [INFO] [utils.py:791:see_memory_usage] Stage 3 initialize beginning [2024-01-22 16:50:42,049] [INFO] [utils.py:792:see_memory_usage] MA 3.78 GB Max_MA 4.01 GB CA 4.04 GB Max_CA 4 GB [2024-01-22 16:50:42,049] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.39 GB, percent = 84.6% [2024-01-22 16:50:42,050] [INFO] [stage3.py:128:init] Reduce bucket size 500,000,000 [2024-01-22 16:50:42,050] [INFO] [stage3.py:129:init] Prefetch bucket size 0 [2024-01-22 16:50:42,085] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2024-01-22 16:50:42,086] [INFO] [utils.py:792:see_memory_usage] MA 3.78 GB Max_MA 3.78 GB CA 4.04 GB Max_CA 4 GB [2024-01-22 16:50:42,086] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.39 GB, percent = 84.6% [2024-01-22 16:50:42,765] [INFO] [utils.py:30:print_object] AsyncPartitionedParameterSwapper: [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] aio_handle ................... <class 'async_io.aio_handle'> [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] aligned_elements_per_buffer .. 1000000000 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] available_buffer_ids ......... [0, 1, 2, 3, 4] [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] available_numel .............. 0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] available_params ............. set() [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] dtype ........................ torch.bfloat16 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] elements_per_buffer .......... 1000000000 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] id_to_path ................... {} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] inflight_numel ............... 0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] inflight_params .............. [] [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] inflight_swap_in_buffers ..... [] [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] invalid_buffer ............... 1.0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] numel_alignment .............. 512 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] param_buffer_count ........... 5 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] param_id_to_buffer_id ........ {} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] param_id_to_numel ............ {} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] param_id_to_swap_buffer ...... {} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] partitioned_swap_buffer ...... None [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] partitioned_swap_pool ........ None [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] pending_reads ................ 0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] pending_writes ............... 0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] reserved_buffer_ids .......... [] [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] swap_config .................. device='nvme' nvme_path=PosixPath('nvme') buffer_count=5 buffer_size=1000000000 max_in_cpu=1000000000 pin_memory=True [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] swap_element_size ............ 2 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] swap_folder .................. nvme/zero_stage_3/bfloat16params/rank0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] swap_out_params .............. [] Parameter Offload: Total persistent parameters: 803840 in 194 params [2024-01-22 16:50:44,239] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2024-01-22 16:50:44,239] [INFO] [utils.py:792:see_memory_usage] MA 0.0 GB Max_MA 3.78 GB CA 4.04 GB Max_CA 4 GB [2024-01-22 16:50:44,239] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.4 GB, percent = 84.6% Using /home/tflow/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/tflow/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.037041664123535156 seconds [2024-01-22 16:50:44,315] [INFO] [utils.py:791:see_memory_usage] Before creating fp16 partitions [2024-01-22 16:50:44,315] [INFO] [utils.py:792:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 4.04 GB Max_CA 4 GB [2024-01-22 16:50:44,316] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.4 GB, percent = 84.6% [2024-01-22 16:50:49,553] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1080504

Big appreciation for any help. Thanks!

Jan 22 '24 09:01 0781532

In my program, the bug seems to be triggered that there is insufficient CPU memory when saving the checkpoint.

Thanks! This works for my program.

Jul 16 '24 03:07 liuchengyuan123

Facing the same problem, using deepspeed zero2+offload to save model is easily killed when saving a shard of model. Please add at least a warning to inform the insufficient of cpu. It takes quite a few days to find out the solution by removing the offload setting

Jul 25 '24 12:07 janenie

DeepSpeed DeepSpeed copied to clipboard

[BUG]deepspeed-zero-2-cpu-offloading-killing-process-9-error

This is my config file:

This is tracking history:

DeepSpeed
DeepSpeed copied to clipboard