fasterrcnn-pytorch-training-pipeline icon indicating copy to clipboard operation
fasterrcnn-pytorch-training-pipeline copied to clipboard

Distributed execution on single machine?

Open tolsicsse opened this issue 2 years ago • 6 comments

Is there any instruction to follow for creating a distributed execution? Preferably for execution on a single machine with multiple GPUs. I can se that it should be possible but I don't understand how to do it.

tolsicsse avatar Nov 29 '22 07:11 tolsicsse

Yes, it is possible. I intend to update the README with a lot of things including this after I merge your PR. For the time being, please refer to this. For example, if you intend to train on 2 GPUs, the command should be like this:

python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --model fasterrcnn_resnet18 --config data_configs/voc.yaml --world-size 2 --batch-size 32 --workers 2 --epochs 135 --use-train-aug --project-name fasterrcnn_resnet18_voc_aug_135e

After python -m torch.distributed.launch --nproc_per_node=2 --use_env the usual training command follows.

sovit-123 avatar Nov 29 '22 09:11 sovit-123

I got the distributed execution to work. However, I notice two things:

  • the performance is logged several time in result.csv (not a big issue, easy to fix afterwards)
  • the best mAP seems not be kept after resumed training, so the process seems to start from scratch so it looses the best model. This might be true also when not using several gpus

tolsicsse avatar Dec 06 '22 10:12 tolsicsse

I will check the logging issue that you mention above. Regarding the best model issues, can you elaborate on whether you are trying to resume training or using the best model weights on some other dataset without resuming? In any case, if you want to resume training then I recommend using the last_model.pth as it loads the optimizer state dictionary also. The mAP value with this model when resuming training will be closer to what you stopped the training with. But most probably it will not be the same. I will take 2-3 epochs to orient itself back to the same mAP.

sovit-123 avatar Dec 06 '22 10:12 sovit-123

image

I use last_model.pth and as you can see above, at epoch 200 i resumed training, and it drops in performance and it then starts to produce new best mAP that are saved, although it was better before. Also at epoch 200 I start trainiung on 4 gpus instead of 1 so it might be the reason to why the variance decreased,

tolsicsse avatar Dec 06 '22 11:12 tolsicsse

Also when I resume training with last_model.pth after trained on 4 GPUs, I get the following error below. It seems that the class head is not saved correctly.

Traceback (most recent call last): File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 505, in Traceback (most recent call last): File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 505, in Traceback (most recent call last): File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 505, in main(args) File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 270, in main main(args) File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 270, in main main(args) File "fasterrcnn-pytorch-training-pipeline-new/train.py", line 270, in main old_classes = ckpt_state_dict['roi_heads.box_predictor.cls_score.weight'].shape[0] KeyError: 'roi_heads.box_predictor.cls_score.weight' old_classes = ckpt_state_dict['roi_heads.box_predictor.cls_score.weight'].shape[0]

tolsicsse avatar Dec 06 '22 12:12 tolsicsse

Thanks for informing this. I will surely check this out. I am not sure how the head will differ on 4 different GPUs compared to a single GPU. But I will check this out.

I also think that the I need to implement SyncBN for distributed training that I have not done yet.

sovit-123 avatar Dec 06 '22 13:12 sovit-123