DenseNet Why did you use MomentumOptimizer? and dropout...

Hello When I saw DenseNet, I implemented it with Tensorflow. (Using MNIST data)

The Questions are :

When I experimented, AdamOptimizer performed better than MomentumOptimizer. Is this just MNIST? I do not yet have an experiment with CIFAR.
In the case of dropout, I apply only to the bottleneck layer, not to the transition layer. is this right?
Does Batch Normalization only apply when training? Or does it apply to both test and training?
I wonder what global average pooling is.
And I wonder how to do it in tensorflow.

Please advise if you have any special reason. And if you can see the tensorflow code, I'd like you to see if I implemented it correctly. https://github.com/taki0112/Densenet-Tensorflow

Thank you

Aug 08 '17 09:08 taki0112

Hello @taki0112

A1. As we mentioned in the paper, we directly followed ResNet's optimization settings (https://github.com/facebook/fb.resnet.torch), except that we train 300 epochs instead of ~160 epochs. We didn't try any other optimizers.

A2. In our experiment, we applied dropout to every conv layer except the first one of the network. But I guess there should be no significant difference whether you apply dropout on trans layers or not.

A3. This depends on what package you are using. Sorry I'm not familiar with Tensorflow's details.

A4. Global Average Pooling means you pool a feature map to a single number by taking average. For example, you have a 8x8 feature map, you take average of those 64 numbers and produce one number.

For tensorflow usage question like 3 and 4, you can probably find answers by looking at the third- party tensorflow implementations we posted on our readme page. Thanks

Aug 08 '17 20:08 liuzhuang13

Thank you I think I can do global average pooling as follows.

    def Global_Average_Pooling(x, stride=1) :
        width = np.shape(x)[1]
        height = np.shape(x)[2]
        pool_size = [width, height]
        return tf.layers.average_pooling2d(inputs=x, pool_size=pool_size, strides=stride) 
        # The stride value does not matter

But I have some questions.

I experimented with MNIST data for a total of 100 layers and growth_k = 12. However, the result is worse than 20 layers. The training speed is very slow and the increase in accuracy is very narrow.
why is not there a Transition Layer (4) in paper ? There are only 3 (Dense Block + Transition Layers) and final dense block and Classification layer..

What is the reason?

Aug 11 '17 08:08 taki0112

@taki0112

Most people train a network with less than 5 layers and achieve very high accuracy on MNIST because it is such a simple dataset. If you train a too large network on MNIST, it might overfit to the training set and the accuracy might be worse. Thanks
Because transition layers serves the purpose of downsampling. At last we have the global average pooling to do the downsample but we don't call it a transition layer.

Aug 11 '17 09:08 liuzhuang13

I think the author has a good explanation. Regarding the dropout, why did not use dropout in imagenet case? it is the big dataset, so we do not need it, right? Dropout often uses in before fully connected layer. But, you did not use it in both imagenet and cifar10, why? Thanks

Aug 11 '17 12:08 John1231983

@John1231983 Because ImageNet is big and also because we use heavy data augmentation, so we don't use dropout. This is also following our base code framework fb.resnet.torch.

For CIFAR10, when we use data augmentation (C10+), we don't use dropout. When we don't use data augmentation (C10), we actually use dropout. We've mentioned this in the paper.

Oct 18 '17 05:10 liuzhuang13

DenseNet DenseNet copied to clipboard

Why did you use MomentumOptimizer? and dropout...

DenseNet
DenseNet copied to clipboard