CondenseNet icon indicating copy to clipboard operation
CondenseNet copied to clipboard

DenseNet-121 is faster than CondenseNet-74 (C=G=4) on GTX 1080 Ti

Open ivankreso opened this issue 7 years ago • 3 comments

I compared the forward pass speed of the larger ImageNet model with DenseNet-121 and the latter actually works faster. After benchmarking my guess is that CondenseConv layer is the cause of the slowdown due to memory transfers in ShuffleLayer and torch.index_select. @ShichenLiu can you comment on this, did you get better performance compared to DenseNet-121 in your experiments?

ivankreso avatar Nov 30 '17 09:11 ivankreso

Our model is mainly designed for mobile devices, on which the actual inference time highly correlates with the theoretical complexity. However, the group convolution and index/shuffle operations are not efficiently implemented on GPU.

ShichenLiu avatar Nov 30 '17 13:11 ShichenLiu

GPUs tend to be memory-bound rather than compute-bound, in particular, for small models that require additional memory transfers such as ShuffleNets and CondenseNets. On mobile devices, embedded systems, etc. the ratio between compute (in FLOPS) and memory bandwidth is very different though: convnets tend to be compute-bound on such platforms. If you did the same comparison on such a platform, you would find that a CondenseNet is much faster than a DenseNet (see Table 5 of the paper for actual timing results on an ARM processor).

lvdmaaten avatar Nov 30 '17 15:11 lvdmaaten

Thanks for clarification. I already suspected that is the reason after I measured time spent in bottleneck 1x1 layer and grouped 3x3 layer. Forward pass spends twice as much time in 1x1 compared to 3x3. I think there is a way to avoid additional memory transfers on GPUs if CUDNN implementation allows you to specify custom feature maps ordering after grouped convolution. I don't know if this feature is available in CUDNN but if I am correct then you could remove all feature shuffling ops.

ivankreso avatar Dec 01 '17 10:12 ivankreso