pytorch-deeplab-xception reproduced result is low on ResNet backbone

reproduced result is low on ResNet backbone

Open youngwanLEE opened this issue 5 years ago • 22 comments

I tried to reproduce the result(78.43%) of ResNet backbone in README.MD just by using this command.

Of course, I prepared for SBD dataset and train the model on VOC2012 + SBD.

Then, my result (75.06) is lower than the reported one. How can I get the reported result?

Is the reported result(78.43%) fine-tuned result after coco pre-training?

By the way, weird things were shown in the Tensorboard. while train loss converged, validation loss increased.

May 03 '19 01:05 youngwanLEE

I got the same curves - Did you find an explanation ? May the pytorch version make a difference ?

May 12 '19 21:05 theevann

I also obtained a lower result. MIOU 74.58%

May 14 '19 01:05 cxxgtxy

I obtained a similar low result too. mIOU 74.15%. Any solution?

May 29 '19 05:05 lightas

@lightas I contacted the owner and he said that he froze batch norm during training.
Be careful however that the --freeze-bn parameter doesn't work currently in the code !

Since I am interested in Cityscapes, I focused my time on it:

I tried with frozen batch-norm (w/ cityscapes) but it didn't help.
I obtained some correct results (74%) (w/ cityscapes) by lowering the LR at a faster rate using the step scheduler. So maybe you could try the step scheduler ?
I am currently training again (w/ cityscapes) with the poly scheduler but changing the lr_power (in file utils/lr_scheduler.py):
```
 self.lr * pow((1 - 1.0 * T / self.N), lr_power)
```
I'm trying lr_power=3 and it is giving better results in my case, by lowering the learning rate faster.

Note :

Train on multiple GPUs, since a large enough batch-size is very important. (I used 4 GPU with bs 16 : 4 on each gpu).
Train with a small crop size : this allows to process more images per batch and stabilize the training. (I used 550x550)
Remove image-level features in the ASPP module of DeepLab. (Not a big deal though)

I haven't retried on Pascal VOC yet, so I don't know if this makes a difference.

Also note that the new Torchvision v0.3 has deeplabv3 "built-in".

May 29 '19 09:05 theevann

@theevann Thank you so much. I will try it.

May 29 '19 10:05 lightas

@theevann Hi, would you please tell me why you said that the --freeze-bn parameter doesn't work? I didn't find out why it doesn't work.

May 30 '19 07:05 lightas

@lightas The freeze-bn parameter puts BatchNorn into eval mode at model initialization. But then in the training you do model.train(), which sets the BatchNorm back to training mode...

Jun 03 '19 08:06 theevann

he froze batch norm during training.

So the synchronized batch norm does not work in the code? Since sync bn should improve the performance, rather than decrease it

Jun 05 '19 18:06 pengwangucla

@theevann I got it. Thank you so much.

Jun 06 '19 07:06 lightas

@youngwanLEE Hi, how to generate those curves? Thank you!

Aug 16 '19 07:08 beizhengren

    def training(self, epoch):
        train_loss = 0.0
        self.model.train()
        for m in self.model.modules():
            if isinstance(m, SynchronizedBatchNorm2d):
                m.eval()
            elif isinstance(m, nn.BatchNorm2d):
                m.eval()
        tbar = tqdm(self.train_loader)


or

    def training(self, epoch):
        train_loss = 0.0
        self.model.train()

        if self.args.freeze_bn:
            for m in self.model.modules():
                if isinstance(m, SynchronizedBatchNorm2d):
                    m.eval()
                elif isinstance(m, nn.BatchNorm2d):
                    m.eval()

        tbar = tqdm(self.train_loader)
        num_img_tr = len(self.train_loader)

Oct 15 '19 02:10 yangninghua

def make_data_loader(args, **kwargs):

    if args.dataset == 'pascal':
        train_set = pascal.VOCSegmentation(args, split='train')
        val_set = pascal.VOCSegmentation(args, split='val')
        if args.use_sbd:
            sbd_train = sbd.SBDSegmentation(args, split=['train', 'val'])
            train_set = combine_dbs.CombineDBs([train_set, sbd_train], excluded=[val_set])

        num_class = train_set.NUM_CLASSES
        num_class = (you.nums+1)

Oct 15 '19 04:10 yangninghua

@youngwanLEE , what is you training parameter, the max mIoU is 0.61, I'm confused, thank you

Oct 17 '19 06:10 wtsitp

@lightas I contacted the owner and he said that he froze batch norm during training. Be careful however that the --freeze-bn parameter doesn't work currently in the code !

Since I am interested in Cityscapes, I focused my time on it:
I tried with frozen batch-norm (w/ cityscapes) but it didn't help.

I obtained some correct results (74%) (w/ cityscapes) by lowering the LR at a faster rate using the step scheduler. So maybe you could try the step scheduler ?
I am currently training again (w/ cityscapes) with the poly scheduler but changing the lr_power (in file utils/lr_scheduler.py):
 self.lr * pow((1 - 1.0 * T / self.N), lr_power)
I'm trying lr_power=3 and it is giving better results in my case, by lowering the learning rate faster.
Note :

Train on multiple GPUs, since a large enough batch-size is very important. (I used 4 GPU with bs 16 : 4 on each gpu).

Train with a small crop size : this allows to process more images per batch and stabilize the training. (I used 550x550)

Remove image-level features in the ASPP module of DeepLab. (Not a big deal though)

I haven't retried on Pascal VOC yet, so I don't know if this makes a difference.

Also note that the new Torchvision v0.3 has deeplabv3 "built-in".

@theevann Thanks a lot for sharing the useful modification. And I find the DeepLabv3+ paper didn't give the mIOU in on the resnet101 backbone, only Xception in the paper. And for cityscape, the official repo's zoo only add Xception and MobileNetv2 backbone. So how do you get the right mIOU range for DeepLabv3+ with Resnet101 as the backbone? Also, could you tell us some more detail info about your experiments in CityScapes? Did you use the additional coarse data? And did you set the output stride =8 or 16?

Thanks a lot!

Oct 20 '19 06:10 licj15

Same problem. I got 74.**% training on VOC2012 + SBD, and using Resnet. I can not get 78%. Seems that this code base is not good for reproduction.

Oct 28 '19 02:10 GuoleiSun

what is your train config， the acc is 61%，I train

| | 王涛邮箱：[email protected] |

签名由网易邮箱大师定制

On 10/28/2019 10:39, Guolei Sun wrote:

Same problem. I got 74.**% training on VOC2012 + SBD, and using Resnet. I can not get 78%. Seems that this code base is not good for reproduction.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Oct 28 '19 03:10 wtsitp

I use "train_voc.py". Actually, I think my problem is exactly as the author of this issue. But the issue seems not solved.

Oct 28 '19 04:10 GuoleiSun

Basically, what I want to do is to reproduce "78.43%" using "train_voc.py". Of course, I prepared for SBD dataset and train the model on VOC2012 + SBD. But I only got "74%", which is much lower than what I expect

Oct 28 '19 04:10 GuoleiSun

Hi guys, I will share some of my experiment settings.

Dataset: Cityscape without coarse additional data Backbone: Resnet101 output_stride: 16 initial_learning_rate: 0.005 learning_decay: ploy training_epochs: 240 batch_size: 8 num_gpus: 4 train_size: 768*768 no sync_bn others: the default value of the train.py

Inference in 2048*1024 in val dataset of cityscapes, I got ~74% mIOU. I am not sure if it's a right mIOU

Oct 28 '19 18:10 licj15

Hi guys,

For those who want to reproduce results on deeplab V3, I recommend this code: https://github.com/chenxi116/DeepLabv3.pytorch The code can simply reproduce 76.8% mIOU in Pascal val (trained on VOC2012 + SBD).

Oct 29 '19 08:10 GuoleiSun

I also obtained a lower result. MIOU 74.58%

can I have your qq, i have a so poor performace on my owndataset and trained via 8 gpus, thank u very much .

Nov 22 '19 12:11 XUYUNYUN666

Hi guys, I will share some of my experiment settings.

Dataset: Cityscape without coarse additional data Backbone: Resnet101 output_stride: 16 initial_learning_rate: 0.005 learning_decay: ploy training_epochs: 240 batch_size: 8 num_gpus: 4 train_size: 768*768 no sync_bn others: the default value of the train.py

Inference in 2048*1024 in val dataset of cityscapes, I got ~74% mIOU. I am not sure if it's a right mIOU

Can I have your qq number, I really want to get your help. I trained on my dataset with 8 gpus and got so poor performance..

Nov 22 '19 12:11 XUYUNYUN666

pytorch-deeplab-xception pytorch-deeplab-xception copied to clipboard

reproduced result is low on ResNet backbone

pytorch-deeplab-xception
pytorch-deeplab-xception copied to clipboard