examples
examples copied to clipboard
Incorrect accuracy calculation with DistributedDataParallel - examples/imagenet/main.py
The accuracy() function here divides number of correct samples by the batch size. Then, the top1 accuracy is updated with AverageMeter, and the top1.avg is returned. This is incorrect, since the input to the top1 AverageMeter is accuracy, and n=images.size(0). Essentially, the number of correct samples is divided by the batch size twice.
Furthermore, the validate function will return different values for each process when using DDP. The results should be combined across all GPUs.
Any update on this? Is this issue still persistent?
Issue seems to be persistent. You could implement your own distributed average meter or use something like TorchMetrics which supports metrics across GPUs - I haven't used this myself but it seems to be well maintained
@d4l3k who may be interested in this
I am using DDP in one node that has 8 GPU and 24 CPUs. I want to know how I can fix the following error. Also, as for loss, gradients, acc, and checkpoints, how do we take care of them if we have a single node multi-GPU situation? Could you please link me to a full example? Write now since more than one GPU accesses same file, the error is raised.
with open (opt.outf+namefile,'a') as file:
s = '{}, {},{:.15f}\n'.format(
epoch,batch_idx,loss.data.item())
print(s)
file.write(s)
load models
Training network pretrained on imagenet.
training data: 3125 batches
load models
Training network pretrained on imagenet.
Train Epoch: 1 [0/50000 (0%)] Loss: 0.047746550291777
Train Epoch: 1 [0/50000 (0%)] Loss: 0.047966860234737
Train Epoch: 1 [0/50000 (0%)] Loss: 0.047879129648209
Train Epoch: 1 [0/50000 (0%)] Loss: 0.047865282744169
Traceback (most recent call last):
File "train.py", line 1330, in <module>
_runnetwork(epoch,trainingdata)
File "train.py", line 1308, in _runnetwork
file.write(s)
OSError: [Errno 5] Input/output error
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58392 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58393 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58394 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 58395) of binary: /anaconda/envs/azureml_py38/bin/python
Traceback (most recent call last):
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda/envs/azureml_py38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
My other questions are do I need a DistributedSampler for DataLoader if I only have one node?
Here's how I run my code:
$ time python -m torch.distributed.launch --nproc_per_node=4 train.py --data MYDATA --out MYOUTPUT --gpuids 0 1 2 3 --batchsize 16
And here's the part of code I have changed from it being DataParallel to now being DistributedDataParallel:
parser.add_argument("--local_rank", default=0, type=int)
torch.cuda.set_device(opt.local_rank)
torch.distributed.init_process_group(backend='nccl',
init_method='env://',
timeout=datetime.timedelta(seconds=5400))
net = torch.nn.parallel.DistributedDataParallel(net.cuda(),
device_ids=[opt.local_rank],
output_device=opt.local_rank)
Do I also need to have distributed optimizer and distributed gradients too?
I also have a similar issue. When loading the model (with DistributedDataParallel) from checkpoint A and fine-tuning it in a parallel mode I get an 85% accuracy while doing the same on a non-parallel mode I get 67%. Both of these experiments are only on 1 gpu, the distributed one is with 10 workers. Do you @d4l3k maybe know why this is the case and which accuracy is actually correct?
Is there any update on this issue?
Update: I have checked the current code again, and they have called dist.all_reduce() before comparing it with best_acc, so I think everything is fine now.