[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object

Open xiaoxiaodecheng opened this issue 8 months ago • 1 comments

I am using DeepSpeed to accelerate YOLOv5 training and have added the corresponding steps in the training file Below is the code I added to the training file and my DeepSpeed configuration file model,optimizer,_,_ = deepspeed.initialize( args=opt, model=model, optimizer=optimizer, model_parameters=model.parameters(), config=opt.deepspeed_config_file ) ds_config.json { "train_batch_size": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "zero_allow_untested_optimizer": true }

However, I encountered the following error during execution and am not sure how to resolve it

ep_module_on_host=False replace_with_kernel_inject=False [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] timers_config ................ enabled=True synchronized=True [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_batch_size ............. 16 [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_micro_batch_size_per_gpu 16 [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_data_before_expert_parallel_ False [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_node_local_storage ....... False [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] wall_clock_breakdown ......... False [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] weight_quantization_config ... None [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] world_size ................... 1 [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_allow_untested_optimizer True [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_enabled ................. False [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_force_ds_cpu_optimizer .. True [2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_optimization_stage ...... 0 [2025-04-24 20:06:55,076] [INFO] [config.py:993:print_user_config] json = { "train_batch_size": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "zero_allow_untested_optimizer": true } Traceback (most recent call last): File "/home/admslc/code/yolov5/train.py", line 986, in main(opt) File "/home/admslc/code/yolov5/train.py", line 854, in main train(opt.hyp, opt, device, callbacks) File "/home/admslc/code/yolov5/train.py", line 260, in train ema = ModelEMA(model) if RANK in {-1, 0} else None File "/home/admslc/code/yolov5/utils/torch_utils.py", line 412, in init self.ema = deepcopy(de_parallel(model)).eval() # FP32 EMA File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 172, in deepcopy y = _reconstruct(x, memo, *rv) File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 271, in _reconstruct state = deepcopy(state, memo) File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 146, in deepcopy y = copier(x, memo) File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 231, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 161, in deepcopy rv = reductor(4) TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object [2025-04-24 20:06:54,411] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 149527

Apr 24 '25 12:04 xiaoxiaodecheng

this class ((de_parallel(model)).eval() ) includes a torch.distributed.group as a member, and deepcopy cannot copy this object. maybe you can either avoid using deepcopy or exclude this variable from being deepcopied.

Apr 25 '25 06:04 inkcherry