Evaluation error RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective

Open samirchar opened this issue 1 month ago • 1 comments

I am training OpenClip on a distributed environment with 4 nodes and 8 GPUs per node. I set epochs to 40 and after training finishes with the last epoch (epoch 39) I am getting an error CUDA. Notice that it trains until the last epoch fine, including the evaluation after each epoch, but when it enters the last evaluation it fails. Here is the traceback below. Shouldn't the model be unwrapped of DDP for evaluate if only running eval on master? I think that while the last evaluation is running, the other ranks are finishing and DDP tries to sync and fails.

I am using pytorch 2.7.1 in python 3.10 and cuda 12.6

2025-10-30,14:14:29 | INFO | Start epoch 39 2025-10-30,14:14:32 | INFO | Train Epoch: 39 [ 8192/2916352 (0%)] Data (t): 2.483 Batch (t): 3.019, 2713.79/s, 84.8059/s/gpu LR: 0.000002 Logit Scale: 88.598 Contrastive_loss: 0.046415 (0.046415) Loss: 0.046415 (0.046415) 2025-10-30,14:15:42 | INFO | Train Epoch: 39 [ 827392/2916352 (28%)] Data (t): 0.085 Batch (t): 0.699, 11850.7/s, 370.335/s/gpu LR: 0.000001 Logit Scale: 88.607 Contrastive_loss: 0.059184 (0.052800) Loss: 0.059184 (0.052800) 2025-10-30,14:16:51 | INFO | Train Epoch: 39 [1646592/2916352 (56%)] Data (t): 0.085 Batch (t): 0.699, 11751.0/s, 367.218/s/gpu LR: 0.000000 Logit Scale: 88.611 Contrastive_loss: 0.048979 (0.051526) Loss: 0.048979 (0.051526) 2025-10-30,14:18:01 | INFO | Train Epoch: 39 [2465792/2916352 (85%)] Data (t): 0.083 Batch (t): 0.697, 12237.6/s, 382.423/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.049657 (0.051059) Loss: 0.049657 (0.051059) 2025-10-30,14:18:39 | INFO | Train Epoch: 39 [2916352/2916352 (100%)] Data (t): 0.082 Batch (t): 0.685, 13228.0/s, 413.375/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.046689 (0.050185) Loss: 0.046689 (0.050185) Traceback (most recent call last): File "/scratch/amlt_code/bin/train.py", line 1081, in main(sys.argv[1:]) File "/scratch/amlt_code/bin/train.py", line 1013, in main evaluate(model, data, completed_epoch, args, tb_writer=writer, tokenizer=tokenizer) File "/home/aiscuser/.local/lib/python3.10/site-packages/open_clip_train/train.py", line 281, in evaluate model_out = model(images, texts) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1633, in forward inputs, kwargs = self._pre_forward(*inputs, **kwargs) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1529, in _pre_forward self._sync_buffers() File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2166, in _sync_buffers self._sync_module_buffers(authoritative_rank) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2170, in _sync_module_buffers self._default_broadcast_coalesced(authoritative_rank=authoritative_rank) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2192, in _default_broadcast_coalesced self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2107, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective: CollectiveFingerPrint(SequenceNumber=327510, OpType=BROADCAST, TensorShape=[5929], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))). Error: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.25.1.192]:55498

Oct 30 '25 20:10 samirchar

@samirchar you're probably right re with final eval getting out of sync. There should probably be a dist barrier after eval so that the other processes wait until the primary/master finished...

Dec 02 '25 17:12 rwightman