DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]deepspeed-zero-2-cpu-offloading-killing-process-9-error

Open JiyouShin opened this issue 2 years ago • 7 comments

Hi,

I am using deepspeed zero-2 with cpu offloading for finetuning LLM model. I keep getting error like this without any detail error description.

[2023-10-26 17:54:44,801] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2454240
[2023-10-26 17:54:48,155] [ERROR] [launch.py:321:sigkill_handler] ['/data_new/sjy98/polyglot-ko/data-parallel/deepspeed-venv/bin/python3', '-u', 'deepspeed-trainer.py', '--local_rank=3', '--deepspeed', 'deepspeed_config_2.json'] exits with return code = -9

I found out that error message like this is usually because of memory issues. The point is that when I reduce my train data size 54000 to 30000, it works fine. However, somehow I keep getting error when I increase my train data size again.

It is little hard to believe that the length of train data causes memory issue, but is it possible when using deepspeed zero-2 cpu offload?

Below is my deepspeed_config.

{
    "fp16": {
        "enabled": false
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "cpu_offload": true
    },
    "communication_data_type": "fp32",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Also, I am using 4 A100 80G GPU for parallel training. Any suggestion or thoughts would be very helpful. Thanks!

JiyouShin avatar Oct 26 '23 09:10 JiyouShin

Hi @JiyouShin , I got the same error. Have you found any solutions?

ChocoWu avatar Dec 21 '23 07:12 ChocoWu

In my program, the bug seems to be triggered that there is insufficient CPU memory when saving the checkpoint.

ChocoWu avatar Dec 21 '23 08:12 ChocoWu

Any update on this issue. I got the same error.

[2024-01-22 16:50:44,316] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.4 GB, percent = 84.6% [2024-01-22 16:50:49,553] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1080504

I am fine-tuning GPT-3 6.7B with a single GPU of RTX 3090 24G memory.

This is my config file:

{ "train_batch_size" : CONFIG_BATCH_SIZE, "train_micro_batch_size_per_gpu": CONFIG_MBSIZE, "steps_per_print": LOG_INTERVAL, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "nvme", "nvme_path": "nvme", "pin_memory": true, "ratio": 0.3, "buffer_count": 4, "fast_init": false }, "offload_param": { "device": "nvme", "nvme_path": "nvme", "pin_memory": true, "buffer_count": 5, "buffer_size": 1e9, "max_in_cpu": 1e9 },

"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": 0,
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e8,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true

}, "gradient_clipping": 1.0, "prescale_gradients":false,

"fp16": { "enabled": CONFIG_FP16_ENABLED, "loss_scale": 0, "loss_scale_window": 500, "hysteresis": 2, "min_loss_scale": 1, "initial_scale_power": 11 },

"bf16": { "enabled": CONFIG_BF16_ENABLED }, "wall_clock_breakdown" : false }


This is tracking history:

[2024-01-22 16:50:41,962] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.7+870ae041, git-hash=870ae041, git-branch=master [2024-01-22 16:50:42,005] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-01-22 16:50:42,006] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2024-01-22 16:50:42,006] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-01-22 16:50:42,012] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2024-01-22 16:50:42,012] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2024-01-22 16:50:42,012] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2024-01-22 16:50:42,012] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2024-01-22 16:50:42,049] [INFO] [utils.py:791:see_memory_usage] Stage 3 initialize beginning [2024-01-22 16:50:42,049] [INFO] [utils.py:792:see_memory_usage] MA 3.78 GB Max_MA 4.01 GB CA 4.04 GB Max_CA 4 GB [2024-01-22 16:50:42,049] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.39 GB, percent = 84.6% [2024-01-22 16:50:42,050] [INFO] [stage3.py:128:init] Reduce bucket size 500,000,000 [2024-01-22 16:50:42,050] [INFO] [stage3.py:129:init] Prefetch bucket size 0 [2024-01-22 16:50:42,085] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2024-01-22 16:50:42,086] [INFO] [utils.py:792:see_memory_usage] MA 3.78 GB Max_MA 3.78 GB CA 4.04 GB Max_CA 4 GB [2024-01-22 16:50:42,086] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.39 GB, percent = 84.6% [2024-01-22 16:50:42,765] [INFO] [utils.py:30:print_object] AsyncPartitionedParameterSwapper: [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] aio_handle ................... <class 'async_io.aio_handle'> [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] aligned_elements_per_buffer .. 1000000000 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] available_buffer_ids ......... [0, 1, 2, 3, 4] [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] available_numel .............. 0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] available_params ............. set() [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] dtype ........................ torch.bfloat16 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] elements_per_buffer .......... 1000000000 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] id_to_path ................... {} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] inflight_numel ............... 0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] inflight_params .............. [] [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] inflight_swap_in_buffers ..... [] [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] invalid_buffer ............... 1.0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] numel_alignment .............. 512 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] param_buffer_count ........... 5 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] param_id_to_buffer_id ........ {} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] param_id_to_numel ............ {} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] param_id_to_swap_buffer ...... {} [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] partitioned_swap_buffer ...... None [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] partitioned_swap_pool ........ None [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] pending_reads ................ 0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] pending_writes ............... 0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] reserved_buffer_ids .......... [] [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] swap_config .................. device='nvme' nvme_path=PosixPath('nvme') buffer_count=5 buffer_size=1000000000 max_in_cpu=1000000000 pin_memory=True [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] swap_element_size ............ 2 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] swap_folder .................. nvme/zero_stage_3/bfloat16params/rank0 [2024-01-22 16:50:42,765] [INFO] [utils.py:34:print_object] swap_out_params .............. [] Parameter Offload: Total persistent parameters: 803840 in 194 params [2024-01-22 16:50:44,239] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2024-01-22 16:50:44,239] [INFO] [utils.py:792:see_memory_usage] MA 0.0 GB Max_MA 3.78 GB CA 4.04 GB Max_CA 4 GB [2024-01-22 16:50:44,239] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.4 GB, percent = 84.6% Using /home/tflow/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/tflow/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.037041664123535156 seconds [2024-01-22 16:50:44,315] [INFO] [utils.py:791:see_memory_usage] Before creating fp16 partitions [2024-01-22 16:50:44,315] [INFO] [utils.py:792:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 4.04 GB Max_CA 4 GB [2024-01-22 16:50:44,316] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 26.4 GB, percent = 84.6% [2024-01-22 16:50:49,553] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1080504

Big appreciation for any help. Thanks!

0781532 avatar Jan 22 '24 09:01 0781532

In my program, the bug seems to be triggered that there is insufficient CPU memory when saving the checkpoint.

Thanks! This works for my program.

liuchengyuan123 avatar Jul 16 '24 03:07 liuchengyuan123

Facing the same problem, using deepspeed zero2+offload to save model is easily killed when saving a shard of model. Please add at least a warning to inform the insufficient of cpu. It takes quite a few days to find out the solution by removing the offload setting

janenie avatar Jul 25 '24 12:07 janenie