Evaluation error RuntimeError: ProcessGroupWrapper: Monitored Barrier encountered error running collective
I am training OpenClip on a distributed environment with 4 nodes and 8 GPUs per node. I set epochs to 40 and after training finishes with the last epoch (epoch 39) I am getting an error CUDA. Notice that it trains until the last epoch fine, including the evaluation after each epoch, but when it enters the last evaluation it fails. Here is the traceback below. Shouldn't the model be unwrapped of DDP for evaluate if only running eval on master? I think that while the last evaluation is running, the other ranks are finishing and DDP tries to sync and fails.
I am using pytorch 2.7.1 in python 3.10 and cuda 12.6
2025-10-30,14:14:29 | INFO | Start epoch 39
2025-10-30,14:14:32 | INFO | Train Epoch: 39 [ 8192/2916352 (0%)] Data (t): 2.483 Batch (t): 3.019, 2713.79/s, 84.8059/s/gpu LR: 0.000002 Logit Scale: 88.598 Contrastive_loss: 0.046415 (0.046415) Loss: 0.046415 (0.046415)
2025-10-30,14:15:42 | INFO | Train Epoch: 39 [ 827392/2916352 (28%)] Data (t): 0.085 Batch (t): 0.699, 11850.7/s, 370.335/s/gpu LR: 0.000001 Logit Scale: 88.607 Contrastive_loss: 0.059184 (0.052800) Loss: 0.059184 (0.052800)
2025-10-30,14:16:51 | INFO | Train Epoch: 39 [1646592/2916352 (56%)] Data (t): 0.085 Batch (t): 0.699, 11751.0/s, 367.218/s/gpu LR: 0.000000 Logit Scale: 88.611 Contrastive_loss: 0.048979 (0.051526) Loss: 0.048979 (0.051526)
2025-10-30,14:18:01 | INFO | Train Epoch: 39 [2465792/2916352 (85%)] Data (t): 0.083 Batch (t): 0.697, 12237.6/s, 382.423/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.049657 (0.051059) Loss: 0.049657 (0.051059)
2025-10-30,14:18:39 | INFO | Train Epoch: 39 [2916352/2916352 (100%)] Data (t): 0.082 Batch (t): 0.685, 13228.0/s, 413.375/s/gpu LR: 0.000000 Logit Scale: 88.612 Contrastive_loss: 0.046689 (0.050185) Loss: 0.046689 (0.050185)
Traceback (most recent call last):
File "/scratch/amlt_code/bin/train.py", line 1081, in
@samirchar you're probably right re with final eval getting out of sync. There should probably be a dist barrier after eval so that the other processes wait until the primary/master finished...