examples icon indicating copy to clipboard operation
examples copied to clipboard

Incorrect accuracy calculation with DistributedDataParallel - examples/imagenet/main.py

Open numpee opened this issue 4 years ago • 6 comments

The accuracy() function here divides number of correct samples by the batch size. Then, the top1 accuracy is updated with AverageMeter, and the top1.avg is returned. This is incorrect, since the input to the top1 AverageMeter is accuracy, and n=images.size(0). Essentially, the number of correct samples is divided by the batch size twice.

Furthermore, the validate function will return different values for each process when using DDP. The results should be combined across all GPUs.

numpee avatar May 03 '21 09:05 numpee

Any update on this? Is this issue still persistent?

bhattg avatar Feb 03 '22 22:02 bhattg

Issue seems to be persistent. You could implement your own distributed average meter or use something like TorchMetrics which supports metrics across GPUs - I haven't used this myself but it seems to be well maintained

numpee avatar Feb 04 '22 04:02 numpee

@d4l3k who may be interested in this

msaroufim avatar Mar 09 '22 20:03 msaroufim

I am using DDP in one node that has 8 GPU and 24 CPUs. I want to know how I can fix the following error. Also, as for loss, gradients, acc, and checkpoints, how do we take care of them if we have a single node multi-GPU situation? Could you please link me to a full example? Write now since more than one GPU accesses same file, the error is raised.

        with open (opt.outf+namefile,'a') as file:
            s = '{}, {},{:.15f}\n'.format(
                epoch,batch_idx,loss.data.item())
            print(s)
            file.write(s)
load models                                                                                                                                                           
Training network pretrained on imagenet.                                                                                                                              
training data: 3125 batches                                                                                                                                           
load models                                                                                                                                                           
Training network pretrained on imagenet.                                                                                                                              
Train Epoch: 1 [0/50000 (0%)]   Loss: 0.047746550291777                                                                                                               
Train Epoch: 1 [0/50000 (0%)]   Loss: 0.047966860234737                                                                                                               
Train Epoch: 1 [0/50000 (0%)]   Loss: 0.047879129648209                                                                                                               
Train Epoch: 1 [0/50000 (0%)]   Loss: 0.047865282744169                                                                                                               
Traceback (most recent call last):                                                                                                                                    
  File "train.py", line 1330, in <module>                                                                                                                             
    _runnetwork(epoch,trainingdata)                                                                                                                                   
  File "train.py", line 1308, in _runnetwork                                                                                                                          
    file.write(s)                                                                                                                                                     
OSError: [Errno 5] Input/output error                                                                                                                                 
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58392 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58393 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58394 closing signal SIGTERM                                                                    
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 58395) of binary: /anaconda/envs/azureml_py38/bin/python                 
Traceback (most recent call last):                                                                                                                                    
  File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main                                                                         
    return _run_code(code, main_globals, None,                                                                                                                        
  File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 87, in _run_code                                                                                    
    exec(code, run_globals)                                                                                                                                           
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>                                                   
    main()                                                                                                                                                            
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main                                                       
    launch(args)                                                                                                                                                      
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED

My other questions are do I need a DistributedSampler for DataLoader if I only have one node?

Here's how I run my code: $ time python -m torch.distributed.launch --nproc_per_node=4 train.py --data MYDATA --out MYOUTPUT --gpuids 0 1 2 3 --batchsize 16

And here's the part of code I have changed from it being DataParallel to now being DistributedDataParallel:


parser.add_argument("--local_rank", default=0, type=int)

torch.cuda.set_device(opt.local_rank)
torch.distributed.init_process_group(backend='nccl',
                                     init_method='env://',
                                     timeout=datetime.timedelta(seconds=5400))



net = torch.nn.parallel.DistributedDataParallel(net.cuda(),
        device_ids=[opt.local_rank],
        output_device=opt.local_rank)

Do I also need to have distributed optimizer and distributed gradients too?

monajalal avatar Dec 13 '22 21:12 monajalal

I also have a similar issue. When loading the model (with DistributedDataParallel) from checkpoint A and fine-tuning it in a parallel mode I get an 85% accuracy while doing the same on a non-parallel mode I get 67%. Both of these experiments are only on 1 gpu, the distributed one is with 10 workers. Do you @d4l3k maybe know why this is the case and which accuracy is actually correct?

Melika-Ayoughi avatar Sep 08 '23 17:09 Melika-Ayoughi

Is there any update on this issue? Update: I have checked the current code again, and they have called dist.all_reduce() before comparing it with best_acc, so I think everything is fine now.

tungts1101 avatar Jan 09 '24 04:01 tungts1101