DeepSpeed [BUG] torch.compile doesn't work with stage 2 on 32 GPUs

Describe the bug I'm working on a stable diffusion model. When I use torch.compile together with zero2 at 32 GPUs (4 machines 8 A100s each), the training hangs at the first step. It runs well on 16 GPUs (2 machines 8 A100s each or 4 machines 4 A100s each).

To Reproduce Steps to reproduce the behavior:

model = torch.compile(model)
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,model=model,model_parameters=model.parameters())
Run on 32 GPUs and hang
Get the trace

Thread 4939 (idle): "MainThread"
    backward (torch/autograd/__init__.py:199)
    backward (torch/_tensor.py:488)
    backward (deepspeed/runtime/fp16/loss_scaler.py:51)
    backward (deepspeed/runtime/zero/stage_1_and_2.py:2013)
    backward (deepspeed/runtime/engine.py:1860)
    wrapped_fn (deepspeed/utils/nvtx.py:11)
    backward ()
    backward ()
    main ()
    <module> ()
Thread 5031 (idle): "Thread-1"
    wait (threading.py:316)
    wait (threading.py:574)
    run (threading.py:1264)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 5185 (idle): "Thread-2"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:936)
    _wait_for_updates (multiprocessing/pool.py:499)
    _handle_workers (multiprocessing/pool.py:519)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 5186 (idle): "Thread-3"
    _handle_tasks (multiprocessing/pool.py:528)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 5187 (idle): "Thread-4"
    _recv (multiprocessing/connection.py:384)
    _recv_bytes (multiprocessing/connection.py:419)
    recv (multiprocessing/connection.py:255)
    _handle_results (multiprocessing/pool.py:576)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 6112 (idle): "Thread-5"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:936)
    wait_result_broken_or_wakeup (concurrent/futures/process.py:377)
    run (concurrent/futures/process.py:317)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 10675 (idle): "Thread-6"
    wait (threading.py:316)
    wait (threading.py:574)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 11457 (idle): "Thread-7"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:936)
    _poll (multiprocessing/connection.py:429)
    poll (multiprocessing/connection.py:262)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:29)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:52)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 11702 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:233)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 11712 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:233)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 11721 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:233)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 11724 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:233)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 20612 (active+gil): "Dummy-8"
    launcher (<string>:6)
    run (torch/_inductor/triton_ops/autotune.py:188)
    call (cifq5uqmulfdutry2swxfi3d2oau7toq2rdcvbdkkin5a73mq5sd.py:420)
    run (torch/_inductor/compile_fx.py:224)
    _fn (torch/_dynamo/eval_frame.py:209)
    call_func_with_args (torch/_functorch/aot_autograd.py:1021)
    call_compiled_backward (torch/_functorch/aot_autograd.py:1882)
    backward (torch/_functorch/aot_autograd.py:1906)
    apply (torch/autograd/function.py:275)
Thread 20613 (idle)
Thread 20614 (idle)
Thread 20615 (idle)
Thread 20616 (idle)
Thread 20617 (idle)
Thread 20618 (idle)
Thread 20619 (idle)
Thread 20645 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:233)
    run (threading.py:892)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)

Expected behavior The training should not hang.

ds_report output

[2023-03-11 09:09:11,935] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = NewCls
[2023-03-11 09:09:11,936] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-03-11 09:09:11,936] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-03-11 09:09:11,936] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[(0.9, 0.999)]
[2023-03-11 09:09:11,937] [INFO] [config.py:1020:print] DeepSpeedEngine configuration:
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   amp_enabled .................. False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   amp_params ................... False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   bfloat16_enabled ............. True
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   checkpoint_parallel_write_pipeline  False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   checkpoint_tag_validation_enabled  True
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   checkpoint_tag_validation_fail  False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f93a44ed5b0>
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   communication_data_type ...... None
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   curriculum_enabled ........... False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   curriculum_params ............ False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   dataloader_drop_last ......... False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   disable_allgather ............ False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   dump_state ................... False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   dynamic_loss_scale_args ...... None
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   eigenvalue_enabled ........... False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   eigenvalue_gas_boundary_resolution  1
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   eigenvalue_layer_num ......... 0
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   eigenvalue_max_iter .......... 100
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   eigenvalue_stability ......... 1e-06
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   eigenvalue_tol ............... 0.01
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   eigenvalue_verbose ........... False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   elasticity_enabled ........... False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   fp16_auto_cast ............... None
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   fp16_enabled ................. False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   fp16_master_weights_and_gradients  False
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   global_rank .................. 0
[2023-03-11 09:09:11,938] [INFO] [config.py:1024:print]   grad_accum_dtype ............. None
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   gradient_accumulation_steps .. 1
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   gradient_clipping ............ 1
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   gradient_predivide_factor .... 1.0
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   initial_dynamic_scale ........ 1
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   load_universal_checkpoint .... False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   loss_scale ................... 1.0
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   memory_breakdown ............. False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7f93a44ed3d0>
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   optimizer_legacy_fusion ...... False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   optimizer_name ............... None
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   optimizer_params ............. None
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   pld_enabled .................. False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   pld_params ................... False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   prescale_gradients ........... False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   scheduler_name ............... None
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   scheduler_params ............. None
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   sparse_attention ............. None
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   sparse_gradients_enabled ..... False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   steps_per_print .............. 50
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   train_batch_size ............. 1024
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   train_micro_batch_size_per_gpu  32
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   use_node_local_storage ....... False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   wall_clock_breakdown ......... False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   world_size ................... 32
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   zero_allow_untested_optimizer  True
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   zero_enabled ................. True
[2023-03-11 09:09:11,939] [INFO] [config.py:1024:print]   zero_optimization_stage ...... 2
[2023-03-11 09:09:11,939] [INFO] [config.py:1009:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 32, 
    "prescale_gradients": false, 
    "zero_allow_untested_optimizer": true, 
    "bf16": {
        "enabled": true
    }, 
    "fp16": {
        "enabled": false
    }, 
    "wall_clock_breakdown": false, 
    "zero_optimization": {
        "stage": 2, 
        "allgather_partitions": true, 
        "reduce_scatter": true, 
        "overlap_comm": false, 
        "contiguous_gradients": true, 
        "offload_optimizer": {
            "device": "none"
        }
    }, 
    "steps_per_print": 50, 
    "train_batch_size": 1.024000e+03, 
    "gradient_clipping": 1, 
    "gradient_accumulation_steps": 1
}

System info (please complete the following information):

OS: Debian GNU/Linux 11
Machine: 4 machines with x8 A100s each
Interconnects: 100Gbps RDMA
Python version: 3.9.2

Additional context It's quite weird that 2/8 and 4/4 both work, but 4/8 does not work. Also, bfloat16_enabled prints True, but the trace shows deepspeed/runtime/fp16/loss_scaler.py:51.

Any information is appreciated! Thanks!

Mar 16 '23 00:03 lleizuo

@lleizuo , could you please provide additional details (e.g., model and training hyperparams) to reproduce this issue?

Mar 27 '23 17:03 samadejacobs

@lleizuo Hello, have you solved this problem?

Apr 25 '23 08:04 noob-ctrl

Hi @noob-ctrl, do you have a repro?

May 12 '23 18:05 samadejacobs

Closing, please re-open with a repro if needed.

Jul 05 '23 00:07 samadejacobs

DeepSpeed DeepSpeed copied to clipboard

[BUG] torch.compile doesn't work with stage 2 on 32 GPUs

DeepSpeed
DeepSpeed copied to clipboard