HOIM-PyTorch Get inferior results when using distributed training.

非常有幸拜读您的论文，感觉很受启发。在训练模型的过程中，我遇到了一些小问题，还希望您能慷慨解答。

I'm trying to train model with 3 RTX 2080Ti using: 我尝试用三张2080Ti进行训练，使用的命令和参数是：

CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node=3 scripts/train_hoim.py --debug --lr_warm_up -p ./logs/self-train-batch6_c/ --batch_size 2 --nw 5 --w_RCNN_loss_bbox 10.0 --epochs 22 --lr 0.003 --distributed.

I encountered some issue with the SyncBatchNorm which said expect at least 3D input (got 2D input). According to my investigation, SyncBatchNorm is not compatible with BatchNorm1d having a 2D input in PyTorch 1.2. So I substitute the BatchNorm1d in faster_rcnn_hoim.py with a workaround: 由于PyTorch 1.2版本的SyncBatchNorm与BatchNorm1d不是很兼容，不支持2D的输入，所以我自己写了一个类来替换faster_rcnn_hoim.py中的BatchNorm1d：

class MyBatchNorm1d(nn.Module):

    def __init__(self, *args):
        super(MyBatchNorm1d, self).__init__()
        self.bn2d = nn.BatchNorm2d(*args)

    def forward(self, x):
        x = x[..., None, None]
        x = self.bn2d(x)
        return x[..., 0, 0]

However, I end up with inferior performance even compared with a model trained with batch=2 on a single GPU. 但是最终使用分布式训练得到的结果比单卡、batch=2训练出来的结果还要差。

I noticed you used a single GPU in the experiment of your paper, I'm wondering if you encountered the similar issue or I did not configured the code correctly. 我注意到在您的论文中使用了单张显卡进行训练，我想知道您是否遇到了相似的问题，亦或是我没有我的训练方式有误？

Thanks! 谢谢！

Jul 27 '20 05:07 0x4f5da2

Yes, I find the same phenomenon when using distributed training. The only difference is that I removed the BatchNorm1d layer instead of writing a customized wrapper. I still haven't figured out why. Possible reasons could be SyncBatchNorm or Buffer Synchronization for the OIM loss. If you find any clue, please do not hesitate to share it here. Many thanks for your feedback 😊

Jul 27 '20 14:07 dichen-cd

According to the documentation of DistributedSampler, I made following changes to trainer.py:

@@ -75,6 +75,9 @@ def get_trainer(args, model, model_without_ddp, train_loader, optimizer, lr_sche
 
     @trainer.on(Events.EPOCH_STARTED)
     def _init_epoch(engine):
+        if args.distributed:
+            train_loader.batch_sampler.sampler.set_epoch(engine.state.epoch)
+            print "set epoch success"
         if engine.state.epoch == 1 and args.train.lr_warm_up:
             warmup_factor = 1. / 1000
             warmup_iters = len(train_loader) - 1

Also, I noticed the optimizer was initialized with the wrapped model in the DDP tutorial, so I made following changes to train_hoim.py:

@@ -73,14 +73,11 @@ def main(args):
                            )
     model.to(device)
 
-    optimizer = get_optimizer(args, model)
-    lr_scheduler = get_lr_scheduler(args, optimizer)
 
     if args.apex:
         from apex import amp
         model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
 
-    model_without_ddp = model
     if args.distributed:
         if args.apex:
             from apex.parallel import DistributedDataParallel, convert_syncbn_model
@@ -90,7 +87,11 @@ def main(args):
             model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
             model = torch.nn.parallel.DistributedDataParallel(
                 model, device_ids=[args.local_rank], find_unused_parameters=True)
-        model_without_ddp = model.module
+    
+    model_without_ddp = model
+
+    optimizer = get_optimizer(args, model)
+    lr_scheduler = get_lr_scheduler(args, optimizer)

Due to limited GPU resources recently in my lab, I was only able to train the model with two RTX 2070 on two machine connect with Ethernet, each model was trained with batch=1 respectively and the other configurations were unchanged. The result is reasonable compared to the previous one(batch=2 on 3 GPU)(62.74% mAP v.s. 82.89% mAP), but it's still not comparable to the model trained with batch=2 on a single GPU(82.89% mAP v.s. 86.92% mAP). BTW, due to limited resources, I haven't figure out which part of changes may account for the improvement.

I noticed the documentation of DDP mentioned the gradients from each node are averaged. That is to say, the gradient is a half when compared with batch=2 on a single GPU. I wrote some 'toy code' and confirmed that. So I doubled the learning rate and trained the model again, but this time the model did not converge properly.

Also, through some 'toy code', I did not find any unexpected behavior of SyncBatchNorm and Buffer Synchronization in the OIM. As far I am concerned, the only possible reason could be the zero padding when combining the images into batches, because when the batch=1 the padding does not exist(apart from padding the image to multiples of 16). But I still prefer this would not be problematic.

If you have any further ideas on this issue, please let my know. Thanks a lot!

Aug 03 '20 06:08 0x4f5da2

Many thanks for your feedback!

Aug 06 '20 01:08 dichen-cd

HOIM-PyTorch HOIM-PyTorch copied to clipboard

Get inferior results when using distributed training.

HOIM-PyTorch
HOIM-PyTorch copied to clipboard