ncnn 最新更新编译出的benchmark程序测试存在性能负优化问题

日志或报错信息

旧版本测试结果： 816x32

Convolution              Conv_0                             0.76ms    |     [816,  32,   3 *1] -> [408,  16,   2 *4]         kernel: 3 x 3     stride: 2 x 2
HardSwish                Div_0                              0.36ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]
Convolution              Conv_1                             0.31ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]         kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_0                             0.10ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]
ConvolutionDepthWise     Conv_2                             0.57ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]         kernel: 3 x 3     stride: 1 x 1
ReLU                     Relu_1                             0.09ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]
Split                    splitncnn_0                        0.01ms    |
Pooling                  GlobalAveragePool_0                0.08ms    |     [408,  16,   2 *4] -> [  2 *4]
Convolution              Conv_3                             0.04ms    |               [  2 *4] -> [  2 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_2                             0.01ms    |               [  2 *1] -> [  2 *1]
Convolution              Conv_4                             0.03ms    |               [  2 *1] -> [  8 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_0                      0.02ms    |               [  8 *1] -> [  2 *4]
BinaryOp                 Mul_1                              0.13ms    |
Convolution              Conv_5                             0.48ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]         kernel: 1 x 1     stride: 1 x 1
Convolution              Conv_6                             0.57ms    |     [408,  16,   2 *4] -> [408,  16,  10 *4]         kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_3                             0.68ms    |     [408,  16,  10 *4] -> [408,  16,  10 *4]
ConvolutionDepthWise     Conv_7                             2.53ms    |     [408,  16,  10 *4] -> [408,   8,  10 *4]         kernel: 3 x 3     stride: 1 x 2
ReLU                     Relu_4                             0.47ms    |     [408,   8,  10 *4] -> [408,   8,  10 *4]
Convolution              Conv_8                             0.79ms    |     [408,   8,  10 *4] -> [408,   8,   4 *4]         kernel: 1 x 1     stride: 1 x 1
Split                    splitncnn_1                        0.01ms    |
Convolution              Conv_9                             0.69ms    |     [408,   8,   4 *4] -> [408,   8,  12 *4]         kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_5                             0.52ms    |     [408,   8,  12 *4] -> [408,   8,  12 *4]
ConvolutionDepthWise     Conv_10                            1.01ms    |     [408,   8,  12 *4] -> [408,   8,  12 *4]         kernel: 3 x 3     stride: 1 x 1
ReLU                     Relu_6                             0.30ms    |     [408,   8,  12 *4] -> [408,   8,  12 *4]
Convolution              Conv_11                            1.04ms    |     [408,   8,  12 *4] -> [408,   8,   4 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_3                              0.24ms    |
Convolution              Conv_12                            0.78ms    |     [408,   8,   4 *4] -> [408,   8,  12 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_1                              0.33ms    |     [408,   8,  12 *4] -> [408,   8,  12 *4]
ConvolutionDepthWise     Conv_13                            2.41ms    |     [408,   8,  12 *4] -> [408,   4,  12 *4]         kernel: 5 x 5     stride: 1 x 2
HardSwish                Div_2                              0.34ms    |     [408,   4,  12 *4] -> [408,   4,  12 *4]
Split                    splitncnn_2                        0.02ms    |
Pooling                  GlobalAveragePool_1                0.13ms    |     [408,   4,  12 *4] -> [ 12 *4]
Convolution              Conv_14                            0.15ms    |               [ 12 *4] -> [ 12 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_7                             0.02ms    |               [ 12 *1] -> [  3 *4]
Convolution              Conv_15                            0.03ms    |               [  3 *4] -> [ 48 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_1                      0.21ms    |               [ 48 *1] -> [ 12 *4]
BinaryOp                 Mul_4                              0.19ms    |
Convolution              Conv_16                            0.91ms    |     [408,   4,  12 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
Split                    splitncnn_3                        0.01ms    |
Convolution              Conv_17                            1.16ms    |     [408,   4,   6 *4] -> [408,   4,  30 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_3                              0.61ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]
ConvolutionDepthWise     Conv_18                            2.28ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_4                              0.76ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]
Split                    splitncnn_4                        0.01ms    |
Pooling                  GlobalAveragePool_2                0.31ms    |     [408,   4,  30 *4] -> [ 30 *4]
Convolution              Conv_19                            0.14ms    |               [ 30 *4] -> [ 30 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_8                             0.15ms    |               [ 30 *1] -> [ 30 *1]
Convolution              Conv_20                            0.11ms    |               [ 30 *1] -> [120 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_2                      0.09ms    |               [120 *1] -> [ 30 *4]
BinaryOp                 Mul_7                              0.68ms    |
Convolution              Conv_21                            2.02ms    |     [408,   4,  30 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_12                             0.32ms    |
Split                    splitncnn_5                        0.01ms    |
Convolution              Conv_22                            1.17ms    |     [408,   4,   6 *4] -> [408,   4,  30 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_5                              0.80ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]
ConvolutionDepthWise     Conv_23                            2.23ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_6                              0.80ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]
Split                    splitncnn_6                        0.01ms    |
Pooling                  GlobalAveragePool_3                0.29ms    |     [408,   4,  30 *4] -> [ 30 *4]
Convolution              Conv_24                            0.23ms    |               [ 30 *4] -> [ 30 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_9                             0.02ms    |               [ 30 *1] -> [ 30 *1]
Convolution              Conv_25                            0.23ms    |               [ 30 *1] -> [120 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_3                      0.02ms    |               [120 *1] -> [ 30 *4]
BinaryOp                 Mul_10                             0.60ms    |
Convolution              Conv_26                            1.59ms    |     [408,   4,  30 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_17                             0.32ms    |
Split                    splitncnn_7                        0.01ms    |
Convolution              Conv_27                            0.76ms    |     [408,   4,   6 *4] -> [408,   4,  16 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_7                              0.36ms    |     [408,   4,  16 *4] -> [408,   4,  16 *4]
ConvolutionDepthWise     Conv_28                            1.51ms    |     [408,   4,  16 *4] -> [408,   4,  16 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_8                              0.41ms    |     [408,   4,  16 *4] -> [408,   4,  16 *4]
Split                    splitncnn_8                        0.01ms    |
Pooling                  GlobalAveragePool_4                0.23ms    |     [408,   4,  16 *4] -> [ 16 *4]
Convolution              Conv_29                            0.06ms    |               [ 16 *4] -> [ 16 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_10                            0.02ms    |               [ 16 *1] -> [  4 *4]
Convolution              Conv_30                            0.22ms    |               [  4 *4] -> [ 64 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_4                      0.02ms    |               [ 64 *1] -> [ 16 *4]
BinaryOp                 Mul_13                             0.35ms    |
Convolution              Conv_31                            1.17ms    |     [408,   4,  16 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_22                             0.31ms    |
Split                    splitncnn_9                        0.01ms    |
Convolution              Conv_32                            0.92ms    |     [408,   4,   6 *4] -> [408,   4,  18 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_9                              0.43ms    |     [408,   4,  18 *4] -> [408,   4,  18 *4]
ConvolutionDepthWise     Conv_33                            1.68ms    |     [408,   4,  18 *4] -> [408,   4,  18 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_10                             0.45ms    |     [408,   4,  18 *4] -> [408,   4,  18 *4]
Split                    splitncnn_10                       0.03ms    |
Pooling                  GlobalAveragePool_5                0.30ms    |     [408,   4,  18 *4] -> [ 18 *4]
Convolution              Conv_34                            0.23ms    |               [ 18 *4] -> [ 18 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_11                            0.02ms    |               [ 18 *1] -> [ 18 *1]
Convolution              Conv_35                            0.20ms    |               [ 18 *1] -> [ 72 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_5                      0.02ms    |               [ 72 *1] -> [ 18 *4]
BinaryOp                 Mul_16                             0.46ms    |
Convolution              Conv_36                            1.21ms    |     [408,   4,  18 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_27                             0.32ms    |
Convolution              Conv_37                            1.15ms    |     [408,   4,   6 *4] -> [408,   4,  36 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_11                             0.83ms    |     [408,   4,  36 *4] -> [408,   4,  36 *4]
ConvolutionDepthWise     Conv_38                            3.57ms    |     [408,   4,  36 *4] -> [408,   2,  36 *4]         kernel: 5 x 5     stride: 1 x 2
HardSwish                Div_12                             0.55ms    |     [408,   2,  36 *4] -> [408,   2,  36 *4]
Split                    splitncnn_11                       0.02ms    |
Pooling                  GlobalAveragePool_6                0.30ms    |     [408,   2,  36 *4] -> [ 36 *4]
Convolution              Conv_39                            0.07ms    |               [ 36 *4] -> [ 36 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_12                            0.19ms    |               [ 36 *1] -> [  9 *4]
Convolution              Conv_40                            0.23ms    |               [  9 *4] -> [144 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_6                      0.05ms    |               [144 *1] -> [ 36 *4]
BinaryOp                 Mul_19                             0.34ms    |
Convolution              Conv_41                            1.87ms    |     [408,   2,  36 *4] -> [408,   2,  12 *4]         kernel: 1 x 1     stride: 1 x 1
Split                    splitncnn_12                       0.01ms    |
Convolution              Conv_42                            1.67ms    |     [408,   2,  12 *4] -> [408,   2,  72 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_13                             0.81ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
ConvolutionDepthWise     Conv_43                            3.12ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_14                             0.91ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
Split                    splitncnn_13                       0.02ms    |
Pooling                  GlobalAveragePool_7                0.38ms    |     [408,   2,  72 *4] -> [ 72 *4]
Convolution              Conv_44                            0.09ms    |               [ 72 *4] -> [ 72 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_13                            0.15ms    |               [ 72 *1] -> [ 18 *4]
Convolution              Conv_45                            0.07ms    |               [ 18 *4] -> [288 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_7                      0.19ms    |               [288 *1] -> [ 72 *4]
BinaryOp                 Mul_22                             0.69ms    |
Convolution              Conv_46                            2.49ms    |     [408,   2,  72 *4] -> [408,   2,  12 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_36                             0.36ms    |
Split                    splitncnn_14                       0.01ms    |
Convolution              Conv_47                            1.69ms    |     [408,   2,  12 *4] -> [408,   2,  72 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_15                             0.84ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
ConvolutionDepthWise     Conv_48                            3.14ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_16                             0.78ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
Split                    splitncnn_15                       0.01ms    |
Pooling                  GlobalAveragePool_8                0.37ms    |     [408,   2,  72 *4] -> [ 72 *4]
Convolution              Conv_49                            0.09ms    |               [ 72 *4] -> [ 72 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_14                            0.07ms    |               [ 72 *1] -> [ 18 *4]
Convolution              Conv_50                            0.28ms    |               [ 18 *4] -> [288 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_8                      0.06ms    |               [288 *1] -> [ 72 *4]
BinaryOp                 Mul_25                             0.76ms    |
Convolution              Conv_51                            2.35ms    |     [408,   2,  72 *4] -> [408,   2,  12 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_41                             0.34ms    |
Convolution              Conv_52                            1.80ms    |     [408,   2,  12 *4] -> [408,   2,  72 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_17                             0.94ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
Pooling                  MaxPool_0                          0.53ms    |     [408,   2,  72 *4] -> [204,   1,  72 *4]
Squeeze                  Squeeze_0                          0.40ms    |     [204,   1,  72 *4] -> [204, 288 *1]
Permute                  Transpose_3                        0.72ms    |          [204, 288 *1] -> [288, 204 *1]
LSTM                     LSTM_0                            47.92ms    |
LSTM                     LSTM_4                            19.72ms    |
InnerProduct             MatMul_0                          47.91ms    |          [ 96, 204 *1] -> [6625,  51 *4]
MemoryData               ctc_fc_b_attr                      0.02ms    |
BinaryOp                 Add_43                             4.10ms    |
Softmax                  Softmax_0                         10.79ms    |         [6625,  51 *4] -> [6625,  51 *4]
           816x32  avg =  226.51

新版本测试结果： 816x32

Convolution              Conv_0                             0.70ms    |     [816,  32,   3 *1] -> [408,  16,   2 *4]         kernel: 3 x 3     stride: 2 x 2
HardSwish                Div_0                              0.44ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]
Convolution              Conv_1                             0.63ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]         kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_0                             0.40ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]
ConvolutionDepthWise     Conv_2                             0.70ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]         kernel: 3 x 3     stride: 1 x 1
ReLU                     Relu_1                             0.09ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]
Split                    splitncnn_0                        0.02ms    |
Pooling                  GlobalAveragePool_0                0.36ms    |     [408,  16,   2 *4] -> [  2 *4]
Convolution              Conv_3                             0.42ms    |               [  2 *4] -> [  2 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_2                             0.03ms    |               [  2 *1] -> [  2 *1]
Convolution              Conv_4                             0.03ms    |               [  2 *1] -> [  8 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_0                      0.02ms    |               [  8 *1] -> [  2 *4]
BinaryOp                 Mul_1                              0.40ms    |
Convolution              Conv_5                             0.80ms    |     [408,  16,   2 *4] -> [408,  16,   2 *4]         kernel: 1 x 1     stride: 1 x 1
Convolution              Conv_6                             0.95ms    |     [408,  16,   2 *4] -> [408,  16,  10 *4]         kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_3                             0.95ms    |     [408,  16,  10 *4] -> [408,  16,  10 *4]
ConvolutionDepthWise     Conv_7                             2.55ms    |     [408,  16,  10 *4] -> [408,   8,  10 *4]         kernel: 3 x 3     stride: 1 x 2
ReLU                     Relu_4                             0.63ms    |     [408,   8,  10 *4] -> [408,   8,  10 *4]
Convolution              Conv_8                             0.96ms    |     [408,   8,  10 *4] -> [408,   8,   4 *4]         kernel: 1 x 1     stride: 1 x 1
Split                    splitncnn_1                        0.01ms    |
Convolution              Conv_9                             0.93ms    |     [408,   8,   4 *4] -> [408,   8,  12 *4]         kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_5                             0.65ms    |     [408,   8,  12 *4] -> [408,   8,  12 *4]
ConvolutionDepthWise     Conv_10                            1.25ms    |     [408,   8,  12 *4] -> [408,   8,  12 *4]         kernel: 3 x 3     stride: 1 x 1
ReLU                     Relu_6                             0.66ms    |     [408,   8,  12 *4] -> [408,   8,  12 *4]
Convolution              Conv_11                            1.05ms    |     [408,   8,  12 *4] -> [408,   8,   4 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_3                              0.49ms    |
Convolution              Conv_12                            1.16ms    |     [408,   8,   4 *4] -> [408,   8,  12 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_1                              0.81ms    |     [408,   8,  12 *4] -> [408,   8,  12 *4]
ConvolutionDepthWise     Conv_13                            2.40ms    |     [408,   8,  12 *4] -> [408,   4,  12 *4]         kernel: 5 x 5     stride: 1 x 2
HardSwish                Div_2                              0.59ms    |     [408,   4,  12 *4] -> [408,   4,  12 *4]
Split                    splitncnn_2                        0.02ms    |
Pooling                  GlobalAveragePool_1                0.29ms    |     [408,   4,  12 *4] -> [ 12 *4]
Convolution              Conv_14                            0.05ms    |               [ 12 *4] -> [ 12 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_7                             0.26ms    |               [ 12 *1] -> [  3 *4]
Convolution              Conv_15                            0.04ms    |               [  3 *4] -> [ 48 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_1                      0.23ms    |               [ 48 *1] -> [ 12 *4]
BinaryOp                 Mul_4                              0.49ms    |
Convolution              Conv_16                            0.94ms    |     [408,   4,  12 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
Split                    splitncnn_3                        0.01ms    |
Convolution              Conv_17                            1.25ms    |     [408,   4,   6 *4] -> [408,   4,  30 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_3                              1.02ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]
ConvolutionDepthWise     Conv_18                            2.28ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_4                              0.96ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]
Split                    splitncnn_4                        0.01ms    |
Pooling                  GlobalAveragePool_2                0.38ms    |     [408,   4,  30 *4] -> [ 30 *4]
Convolution              Conv_19                            0.28ms    |               [ 30 *4] -> [ 30 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_8                             0.02ms    |               [ 30 *1] -> [ 30 *1]
Convolution              Conv_20                            0.27ms    |               [ 30 *1] -> [120 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_2                      0.02ms    |               [120 *1] -> [ 30 *4]
BinaryOp                 Mul_7                              0.87ms    |
Convolution              Conv_21                            1.65ms    |     [408,   4,  30 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_12                             0.52ms    |
Split                    splitncnn_5                        0.01ms    |
Convolution              Conv_22                            1.21ms    |     [408,   4,   6 *4] -> [408,   4,  30 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_5                              1.00ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]
ConvolutionDepthWise     Conv_23                            2.33ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_6                              0.90ms    |     [408,   4,  30 *4] -> [408,   4,  30 *4]
Split                    splitncnn_6                        0.01ms    |
Pooling                  GlobalAveragePool_3                0.42ms    |     [408,   4,  30 *4] -> [ 30 *4]
Convolution              Conv_24                            0.30ms    |               [ 30 *4] -> [ 30 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_9                             0.02ms    |               [ 30 *1] -> [ 30 *1]
Convolution              Conv_25                            0.28ms    |               [ 30 *1] -> [120 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_3                      0.24ms    |               [120 *1] -> [ 30 *4]
BinaryOp                 Mul_10                             0.84ms    |
Convolution              Conv_26                            1.64ms    |     [408,   4,  30 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_17                             0.53ms    |
Split                    splitncnn_7                        0.01ms    |
Convolution              Conv_27                            1.01ms    |     [408,   4,   6 *4] -> [408,   4,  16 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_7                              0.64ms    |     [408,   4,  16 *4] -> [408,   4,  16 *4]
ConvolutionDepthWise     Conv_28                            1.37ms    |     [408,   4,  16 *4] -> [408,   4,  16 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_8                              0.64ms    |     [408,   4,  16 *4] -> [408,   4,  16 *4]
Split                    splitncnn_8                        0.01ms    |
Pooling                  GlobalAveragePool_4                0.53ms    |     [408,   4,  16 *4] -> [ 16 *4]
Convolution              Conv_29                            0.26ms    |               [ 16 *4] -> [ 16 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_10                            0.23ms    |               [ 16 *1] -> [  4 *4]
Convolution              Conv_30                            0.04ms    |               [  4 *4] -> [ 64 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_4                      0.22ms    |               [ 64 *1] -> [ 16 *4]
BinaryOp                 Mul_13                             0.53ms    |
Convolution              Conv_31                            1.31ms    |     [408,   4,  16 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_22                             0.53ms    |
Split                    splitncnn_9                        0.01ms    |
Convolution              Conv_32                            1.02ms    |     [408,   4,   6 *4] -> [408,   4,  18 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_9                              0.68ms    |     [408,   4,  18 *4] -> [408,   4,  18 *4]
ConvolutionDepthWise     Conv_33                            1.75ms    |     [408,   4,  18 *4] -> [408,   4,  18 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_10                             0.75ms    |     [408,   4,  18 *4] -> [408,   4,  18 *4]
Split                    splitncnn_10                       0.01ms    |
Pooling                  GlobalAveragePool_5                0.30ms    |     [408,   4,  18 *4] -> [ 18 *4]
Convolution              Conv_34                            0.29ms    |               [ 18 *4] -> [ 18 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_11                            0.02ms    |               [ 18 *1] -> [ 18 *1]
Convolution              Conv_35                            0.03ms    |               [ 18 *1] -> [ 72 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_5                      0.02ms    |               [ 72 *1] -> [ 18 *4]
BinaryOp                 Mul_16                             0.58ms    |
Convolution              Conv_36                            1.41ms    |     [408,   4,  18 *4] -> [408,   4,   6 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_27                             0.39ms    |
Convolution              Conv_37                            1.39ms    |     [408,   4,   6 *4] -> [408,   4,  36 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_11                             1.06ms    |     [408,   4,  36 *4] -> [408,   4,  36 *4]
ConvolutionDepthWise     Conv_38                            3.80ms    |     [408,   4,  36 *4] -> [408,   2,  36 *4]         kernel: 5 x 5     stride: 1 x 2
HardSwish                Div_12                             0.69ms    |     [408,   2,  36 *4] -> [408,   2,  36 *4]
Split                    splitncnn_11                       0.01ms    |
Pooling                  GlobalAveragePool_6                0.57ms    |     [408,   2,  36 *4] -> [ 36 *4]
Convolution              Conv_39                            0.27ms    |               [ 36 *4] -> [ 36 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_12                            0.25ms    |               [ 36 *1] -> [  9 *4]
Convolution              Conv_40                            0.25ms    |               [  9 *4] -> [144 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_6                      0.25ms    |               [144 *1] -> [ 36 *4]
BinaryOp                 Mul_19                             0.55ms    |
Convolution              Conv_41                            1.59ms    |     [408,   2,  36 *4] -> [408,   2,  12 *4]         kernel: 1 x 1     stride: 1 x 1
Split                    splitncnn_12                       0.01ms    |
Convolution              Conv_42                            1.84ms    |     [408,   2,  12 *4] -> [408,   2,  72 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_13                             1.03ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
ConvolutionDepthWise     Conv_43                            3.07ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_14                             1.02ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
Split                    splitncnn_13                       0.01ms    |
Pooling                  GlobalAveragePool_7                0.45ms    |     [408,   2,  72 *4] -> [ 72 *4]
Convolution              Conv_44                            0.35ms    |               [ 72 *4] -> [ 72 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_13                            0.02ms    |               [ 72 *1] -> [ 18 *4]
Convolution              Conv_45                            0.32ms    |               [ 18 *4] -> [288 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_7                      0.25ms    |               [288 *1] -> [ 72 *4]
BinaryOp                 Mul_22                             0.82ms    |
Convolution              Conv_46                            2.82ms    |     [408,   2,  72 *4] -> [408,   2,  12 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_36                             0.39ms    |
Split                    splitncnn_14                       0.01ms    |
Convolution              Conv_47                            1.96ms    |     [408,   2,  12 *4] -> [408,   2,  72 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_15                             1.07ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
ConvolutionDepthWise     Conv_48                            3.19ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]         kernel: 5 x 5     stride: 1 x 1
HardSwish                Div_16                             1.04ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
Split                    splitncnn_15                       0.01ms    |
Pooling                  GlobalAveragePool_8                0.72ms    |     [408,   2,  72 *4] -> [ 72 *4]
Convolution              Conv_49                            0.11ms    |               [ 72 *4] -> [ 72 *1]                   kernel: 1 x 1     stride: 1 x 1
ReLU                     Relu_14                            0.28ms    |               [ 72 *1] -> [ 18 *4]
Convolution              Conv_50                            0.07ms    |               [ 18 *4] -> [288 *1]                   kernel: 1 x 1     stride: 1 x 1
HardSigmoid              HardSigmoid_8                      0.02ms    |               [288 *1] -> [ 72 *4]
BinaryOp                 Mul_25                             0.93ms    |
Convolution              Conv_51                            2.38ms    |     [408,   2,  72 *4] -> [408,   2,  12 *4]         kernel: 1 x 1     stride: 1 x 1
BinaryOp                 Add_41                             0.39ms    |
Convolution              Conv_52                            1.97ms    |     [408,   2,  12 *4] -> [408,   2,  72 *4]         kernel: 1 x 1     stride: 1 x 1
HardSwish                Div_17                             1.09ms    |     [408,   2,  72 *4] -> [408,   2,  72 *4]
Pooling                  MaxPool_0                          0.54ms    |     [408,   2,  72 *4] -> [204,   1,  72 *4]
Squeeze                  Squeeze_0                          0.50ms    |     [204,   1,  72 *4] -> [204, 288 *1]
Permute                  Transpose_3                        0.63ms    |          [204, 288 *1] -> [288, 204 *1]
LSTM                     LSTM_0                            45.92ms    |
LSTM                     LSTM_4                            18.50ms    |
InnerProduct             MatMul_0                          71.96ms    |          [ 96, 204 *1] -> [6625,  51 *4]
MemoryData               ctc_fc_b_attr                      0.02ms    |
BinaryOp                 Add_43                             3.69ms    |
Softmax                  Softmax_0                         10.24ms    |         [6625,  51 *4] -> [6625,  51 *4]
       816x32  avg =  268.85

部分网络层耗时对比图

编译/运行环境

编译软件环境：ubuntu 18.04/android-ndk-r19c/cmake 3.10.2 运行环境：android 8.1/spreadtrum sc9832e/4 xARM Cortex-A53@1400MHz 旧版本ncnn commitid:6b2495cc243f2d8e829523b700f32db1f5d50f78 新版本ncnn commitid:6fd801b6d76af8e0ed7f5f3f3b088855f832996b

复现步骤 | 再現方法

1.修改CmakeList.txt

option(NCNN_BENCHMARK "print benchmark information for every layer" ON)

2.修改benchmark/benchncnn.cpp

......
ncnn::Extractor ex = net.create_extractor();
ex.input(input_names[0], in);
ex.extract(output_names[output_names.size()-1], out);
......
benchmark("test_model", ncnn::Mat(816, 32, 3), opt);

3.编译

cd $WORK_DIR/ncnn
mkdir -p build-android-armv7
cd build-android-armv7
rm -rf *
cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI="armeabi-v7a" -DANDROID_ARM_NEON=ON -DANDROID_PLATFORM=android-16 ..
make -j8

4.运行

adb push test_model.param /data/local/tmp/
adb push benchncnn /data/local/tmp/
adb shell "chmod 0755 /data/local/tmp/benchncnn"
adb shell "cd /data/local/tmp; ./benchncnn1 4 0 -1 1"

5.查看打印日志

其他

测试模型文件详见附件： test_model.zip 新版环境下编译出的程序网络推理总耗时变长，网络层MatMul_0耗时差距过大，麻烦帮忙看看是什么原因。

May 17 '22 10:05 bestpower

https://github.com/Tencent/ncnn/pull/3799 验证下来是编译器针对 armv7 优化不力导致，改写了下

May 17 '22 13:05 nihui

#3799 验证下来是编译器针对 armv7 优化不力导致，改写了下

经测试网络层MatMul_0耗时确实降下来了，接近原来版本的，但是跑该模型的benchmark程序100次的平均耗时表现极不稳定，相同模型/相同输入/相同次数/相同设备上多次跑benchmark的平均耗时经常相差好几十毫秒，安卓系统软件环境基本是纯净的，没装什么占用系统资源的软件。

May 18 '22 07:05 bestpower

测试发现ConvolutionDepthWise 比Convolution耗时高一截？？？

Jun 24 '22 01:06 w1005444804

该问题可能是多方面原因引起的，排查起来比较难，可能不一定是ncnn的问题，一般在应用层的体验上看不出太大差别，就暂时搁置吧

Jun 02 '23 08:06 bestpower

ncnn ncnn copied to clipboard

最新更新编译出的benchmark程序测试存在性能负优化问题

日志或报错信息

编译/运行环境

复现步骤 | 再現方法

其他

ncnn
ncnn copied to clipboard