[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
I am using DeepSpeed to accelerate YOLOv5 training and have added the corresponding steps in the training file
Below is the code I added to the training file and my DeepSpeed configuration file
model,optimizer,_,_ = deepspeed.initialize( args=opt, model=model, optimizer=optimizer, model_parameters=model.parameters(), config=opt.deepspeed_config_file )
ds_config.json
{ "train_batch_size": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "zero_allow_untested_optimizer": true }
However, I encountered the following error during execution and am not sure how to resolve it
ep_module_on_host=False replace_with_kernel_inject=False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] timers_config ................ enabled=True synchronized=True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_batch_size ............. 16
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_micro_batch_size_per_gpu 16
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_data_before_expert_parallel_ False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_node_local_storage ....... False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] wall_clock_breakdown ......... False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] weight_quantization_config ... None
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] world_size ................... 1
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_allow_untested_optimizer True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_enabled ................. False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_force_ds_cpu_optimizer .. True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_optimization_stage ...... 0
[2025-04-24 20:06:55,076] [INFO] [config.py:993:print_user_config] json = {
"train_batch_size": 16,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 0
},
"zero_allow_untested_optimizer": true
}
Traceback (most recent call last):
File "/home/admslc/code/yolov5/train.py", line 986, in
this class ((de_parallel(model)).eval() ) includes a torch.distributed.group as a member, and deepcopy cannot copy this object. maybe you can either avoid using deepcopy or exclude this variable from being deepcopied.