ncnn
ncnn copied to clipboard
最新更新编译出的benchmark程序测试存在性能负优化问题
日志或报错信息
旧版本测试结果: 816x32
Convolution Conv_0 0.76ms | [816, 32, 3 *1] -> [408, 16, 2 *4] kernel: 3 x 3 stride: 2 x 2
HardSwish Div_0 0.36ms | [408, 16, 2 *4] -> [408, 16, 2 *4]
Convolution Conv_1 0.31ms | [408, 16, 2 *4] -> [408, 16, 2 *4] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_0 0.10ms | [408, 16, 2 *4] -> [408, 16, 2 *4]
ConvolutionDepthWise Conv_2 0.57ms | [408, 16, 2 *4] -> [408, 16, 2 *4] kernel: 3 x 3 stride: 1 x 1
ReLU Relu_1 0.09ms | [408, 16, 2 *4] -> [408, 16, 2 *4]
Split splitncnn_0 0.01ms |
Pooling GlobalAveragePool_0 0.08ms | [408, 16, 2 *4] -> [ 2 *4]
Convolution Conv_3 0.04ms | [ 2 *4] -> [ 2 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_2 0.01ms | [ 2 *1] -> [ 2 *1]
Convolution Conv_4 0.03ms | [ 2 *1] -> [ 8 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_0 0.02ms | [ 8 *1] -> [ 2 *4]
BinaryOp Mul_1 0.13ms |
Convolution Conv_5 0.48ms | [408, 16, 2 *4] -> [408, 16, 2 *4] kernel: 1 x 1 stride: 1 x 1
Convolution Conv_6 0.57ms | [408, 16, 2 *4] -> [408, 16, 10 *4] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_3 0.68ms | [408, 16, 10 *4] -> [408, 16, 10 *4]
ConvolutionDepthWise Conv_7 2.53ms | [408, 16, 10 *4] -> [408, 8, 10 *4] kernel: 3 x 3 stride: 1 x 2
ReLU Relu_4 0.47ms | [408, 8, 10 *4] -> [408, 8, 10 *4]
Convolution Conv_8 0.79ms | [408, 8, 10 *4] -> [408, 8, 4 *4] kernel: 1 x 1 stride: 1 x 1
Split splitncnn_1 0.01ms |
Convolution Conv_9 0.69ms | [408, 8, 4 *4] -> [408, 8, 12 *4] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_5 0.52ms | [408, 8, 12 *4] -> [408, 8, 12 *4]
ConvolutionDepthWise Conv_10 1.01ms | [408, 8, 12 *4] -> [408, 8, 12 *4] kernel: 3 x 3 stride: 1 x 1
ReLU Relu_6 0.30ms | [408, 8, 12 *4] -> [408, 8, 12 *4]
Convolution Conv_11 1.04ms | [408, 8, 12 *4] -> [408, 8, 4 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_3 0.24ms |
Convolution Conv_12 0.78ms | [408, 8, 4 *4] -> [408, 8, 12 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_1 0.33ms | [408, 8, 12 *4] -> [408, 8, 12 *4]
ConvolutionDepthWise Conv_13 2.41ms | [408, 8, 12 *4] -> [408, 4, 12 *4] kernel: 5 x 5 stride: 1 x 2
HardSwish Div_2 0.34ms | [408, 4, 12 *4] -> [408, 4, 12 *4]
Split splitncnn_2 0.02ms |
Pooling GlobalAveragePool_1 0.13ms | [408, 4, 12 *4] -> [ 12 *4]
Convolution Conv_14 0.15ms | [ 12 *4] -> [ 12 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_7 0.02ms | [ 12 *1] -> [ 3 *4]
Convolution Conv_15 0.03ms | [ 3 *4] -> [ 48 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_1 0.21ms | [ 48 *1] -> [ 12 *4]
BinaryOp Mul_4 0.19ms |
Convolution Conv_16 0.91ms | [408, 4, 12 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
Split splitncnn_3 0.01ms |
Convolution Conv_17 1.16ms | [408, 4, 6 *4] -> [408, 4, 30 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_3 0.61ms | [408, 4, 30 *4] -> [408, 4, 30 *4]
ConvolutionDepthWise Conv_18 2.28ms | [408, 4, 30 *4] -> [408, 4, 30 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_4 0.76ms | [408, 4, 30 *4] -> [408, 4, 30 *4]
Split splitncnn_4 0.01ms |
Pooling GlobalAveragePool_2 0.31ms | [408, 4, 30 *4] -> [ 30 *4]
Convolution Conv_19 0.14ms | [ 30 *4] -> [ 30 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_8 0.15ms | [ 30 *1] -> [ 30 *1]
Convolution Conv_20 0.11ms | [ 30 *1] -> [120 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_2 0.09ms | [120 *1] -> [ 30 *4]
BinaryOp Mul_7 0.68ms |
Convolution Conv_21 2.02ms | [408, 4, 30 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_12 0.32ms |
Split splitncnn_5 0.01ms |
Convolution Conv_22 1.17ms | [408, 4, 6 *4] -> [408, 4, 30 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_5 0.80ms | [408, 4, 30 *4] -> [408, 4, 30 *4]
ConvolutionDepthWise Conv_23 2.23ms | [408, 4, 30 *4] -> [408, 4, 30 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_6 0.80ms | [408, 4, 30 *4] -> [408, 4, 30 *4]
Split splitncnn_6 0.01ms |
Pooling GlobalAveragePool_3 0.29ms | [408, 4, 30 *4] -> [ 30 *4]
Convolution Conv_24 0.23ms | [ 30 *4] -> [ 30 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_9 0.02ms | [ 30 *1] -> [ 30 *1]
Convolution Conv_25 0.23ms | [ 30 *1] -> [120 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_3 0.02ms | [120 *1] -> [ 30 *4]
BinaryOp Mul_10 0.60ms |
Convolution Conv_26 1.59ms | [408, 4, 30 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_17 0.32ms |
Split splitncnn_7 0.01ms |
Convolution Conv_27 0.76ms | [408, 4, 6 *4] -> [408, 4, 16 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_7 0.36ms | [408, 4, 16 *4] -> [408, 4, 16 *4]
ConvolutionDepthWise Conv_28 1.51ms | [408, 4, 16 *4] -> [408, 4, 16 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_8 0.41ms | [408, 4, 16 *4] -> [408, 4, 16 *4]
Split splitncnn_8 0.01ms |
Pooling GlobalAveragePool_4 0.23ms | [408, 4, 16 *4] -> [ 16 *4]
Convolution Conv_29 0.06ms | [ 16 *4] -> [ 16 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_10 0.02ms | [ 16 *1] -> [ 4 *4]
Convolution Conv_30 0.22ms | [ 4 *4] -> [ 64 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_4 0.02ms | [ 64 *1] -> [ 16 *4]
BinaryOp Mul_13 0.35ms |
Convolution Conv_31 1.17ms | [408, 4, 16 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_22 0.31ms |
Split splitncnn_9 0.01ms |
Convolution Conv_32 0.92ms | [408, 4, 6 *4] -> [408, 4, 18 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_9 0.43ms | [408, 4, 18 *4] -> [408, 4, 18 *4]
ConvolutionDepthWise Conv_33 1.68ms | [408, 4, 18 *4] -> [408, 4, 18 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_10 0.45ms | [408, 4, 18 *4] -> [408, 4, 18 *4]
Split splitncnn_10 0.03ms |
Pooling GlobalAveragePool_5 0.30ms | [408, 4, 18 *4] -> [ 18 *4]
Convolution Conv_34 0.23ms | [ 18 *4] -> [ 18 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_11 0.02ms | [ 18 *1] -> [ 18 *1]
Convolution Conv_35 0.20ms | [ 18 *1] -> [ 72 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_5 0.02ms | [ 72 *1] -> [ 18 *4]
BinaryOp Mul_16 0.46ms |
Convolution Conv_36 1.21ms | [408, 4, 18 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_27 0.32ms |
Convolution Conv_37 1.15ms | [408, 4, 6 *4] -> [408, 4, 36 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_11 0.83ms | [408, 4, 36 *4] -> [408, 4, 36 *4]
ConvolutionDepthWise Conv_38 3.57ms | [408, 4, 36 *4] -> [408, 2, 36 *4] kernel: 5 x 5 stride: 1 x 2
HardSwish Div_12 0.55ms | [408, 2, 36 *4] -> [408, 2, 36 *4]
Split splitncnn_11 0.02ms |
Pooling GlobalAveragePool_6 0.30ms | [408, 2, 36 *4] -> [ 36 *4]
Convolution Conv_39 0.07ms | [ 36 *4] -> [ 36 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_12 0.19ms | [ 36 *1] -> [ 9 *4]
Convolution Conv_40 0.23ms | [ 9 *4] -> [144 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_6 0.05ms | [144 *1] -> [ 36 *4]
BinaryOp Mul_19 0.34ms |
Convolution Conv_41 1.87ms | [408, 2, 36 *4] -> [408, 2, 12 *4] kernel: 1 x 1 stride: 1 x 1
Split splitncnn_12 0.01ms |
Convolution Conv_42 1.67ms | [408, 2, 12 *4] -> [408, 2, 72 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_13 0.81ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
ConvolutionDepthWise Conv_43 3.12ms | [408, 2, 72 *4] -> [408, 2, 72 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_14 0.91ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
Split splitncnn_13 0.02ms |
Pooling GlobalAveragePool_7 0.38ms | [408, 2, 72 *4] -> [ 72 *4]
Convolution Conv_44 0.09ms | [ 72 *4] -> [ 72 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_13 0.15ms | [ 72 *1] -> [ 18 *4]
Convolution Conv_45 0.07ms | [ 18 *4] -> [288 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_7 0.19ms | [288 *1] -> [ 72 *4]
BinaryOp Mul_22 0.69ms |
Convolution Conv_46 2.49ms | [408, 2, 72 *4] -> [408, 2, 12 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_36 0.36ms |
Split splitncnn_14 0.01ms |
Convolution Conv_47 1.69ms | [408, 2, 12 *4] -> [408, 2, 72 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_15 0.84ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
ConvolutionDepthWise Conv_48 3.14ms | [408, 2, 72 *4] -> [408, 2, 72 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_16 0.78ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
Split splitncnn_15 0.01ms |
Pooling GlobalAveragePool_8 0.37ms | [408, 2, 72 *4] -> [ 72 *4]
Convolution Conv_49 0.09ms | [ 72 *4] -> [ 72 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_14 0.07ms | [ 72 *1] -> [ 18 *4]
Convolution Conv_50 0.28ms | [ 18 *4] -> [288 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_8 0.06ms | [288 *1] -> [ 72 *4]
BinaryOp Mul_25 0.76ms |
Convolution Conv_51 2.35ms | [408, 2, 72 *4] -> [408, 2, 12 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_41 0.34ms |
Convolution Conv_52 1.80ms | [408, 2, 12 *4] -> [408, 2, 72 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_17 0.94ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
Pooling MaxPool_0 0.53ms | [408, 2, 72 *4] -> [204, 1, 72 *4]
Squeeze Squeeze_0 0.40ms | [204, 1, 72 *4] -> [204, 288 *1]
Permute Transpose_3 0.72ms | [204, 288 *1] -> [288, 204 *1]
LSTM LSTM_0 47.92ms |
LSTM LSTM_4 19.72ms |
InnerProduct MatMul_0 47.91ms | [ 96, 204 *1] -> [6625, 51 *4]
MemoryData ctc_fc_b_attr 0.02ms |
BinaryOp Add_43 4.10ms |
Softmax Softmax_0 10.79ms | [6625, 51 *4] -> [6625, 51 *4]
816x32 avg = 226.51
新版本测试结果: 816x32
Convolution Conv_0 0.70ms | [816, 32, 3 *1] -> [408, 16, 2 *4] kernel: 3 x 3 stride: 2 x 2
HardSwish Div_0 0.44ms | [408, 16, 2 *4] -> [408, 16, 2 *4]
Convolution Conv_1 0.63ms | [408, 16, 2 *4] -> [408, 16, 2 *4] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_0 0.40ms | [408, 16, 2 *4] -> [408, 16, 2 *4]
ConvolutionDepthWise Conv_2 0.70ms | [408, 16, 2 *4] -> [408, 16, 2 *4] kernel: 3 x 3 stride: 1 x 1
ReLU Relu_1 0.09ms | [408, 16, 2 *4] -> [408, 16, 2 *4]
Split splitncnn_0 0.02ms |
Pooling GlobalAveragePool_0 0.36ms | [408, 16, 2 *4] -> [ 2 *4]
Convolution Conv_3 0.42ms | [ 2 *4] -> [ 2 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_2 0.03ms | [ 2 *1] -> [ 2 *1]
Convolution Conv_4 0.03ms | [ 2 *1] -> [ 8 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_0 0.02ms | [ 8 *1] -> [ 2 *4]
BinaryOp Mul_1 0.40ms |
Convolution Conv_5 0.80ms | [408, 16, 2 *4] -> [408, 16, 2 *4] kernel: 1 x 1 stride: 1 x 1
Convolution Conv_6 0.95ms | [408, 16, 2 *4] -> [408, 16, 10 *4] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_3 0.95ms | [408, 16, 10 *4] -> [408, 16, 10 *4]
ConvolutionDepthWise Conv_7 2.55ms | [408, 16, 10 *4] -> [408, 8, 10 *4] kernel: 3 x 3 stride: 1 x 2
ReLU Relu_4 0.63ms | [408, 8, 10 *4] -> [408, 8, 10 *4]
Convolution Conv_8 0.96ms | [408, 8, 10 *4] -> [408, 8, 4 *4] kernel: 1 x 1 stride: 1 x 1
Split splitncnn_1 0.01ms |
Convolution Conv_9 0.93ms | [408, 8, 4 *4] -> [408, 8, 12 *4] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_5 0.65ms | [408, 8, 12 *4] -> [408, 8, 12 *4]
ConvolutionDepthWise Conv_10 1.25ms | [408, 8, 12 *4] -> [408, 8, 12 *4] kernel: 3 x 3 stride: 1 x 1
ReLU Relu_6 0.66ms | [408, 8, 12 *4] -> [408, 8, 12 *4]
Convolution Conv_11 1.05ms | [408, 8, 12 *4] -> [408, 8, 4 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_3 0.49ms |
Convolution Conv_12 1.16ms | [408, 8, 4 *4] -> [408, 8, 12 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_1 0.81ms | [408, 8, 12 *4] -> [408, 8, 12 *4]
ConvolutionDepthWise Conv_13 2.40ms | [408, 8, 12 *4] -> [408, 4, 12 *4] kernel: 5 x 5 stride: 1 x 2
HardSwish Div_2 0.59ms | [408, 4, 12 *4] -> [408, 4, 12 *4]
Split splitncnn_2 0.02ms |
Pooling GlobalAveragePool_1 0.29ms | [408, 4, 12 *4] -> [ 12 *4]
Convolution Conv_14 0.05ms | [ 12 *4] -> [ 12 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_7 0.26ms | [ 12 *1] -> [ 3 *4]
Convolution Conv_15 0.04ms | [ 3 *4] -> [ 48 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_1 0.23ms | [ 48 *1] -> [ 12 *4]
BinaryOp Mul_4 0.49ms |
Convolution Conv_16 0.94ms | [408, 4, 12 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
Split splitncnn_3 0.01ms |
Convolution Conv_17 1.25ms | [408, 4, 6 *4] -> [408, 4, 30 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_3 1.02ms | [408, 4, 30 *4] -> [408, 4, 30 *4]
ConvolutionDepthWise Conv_18 2.28ms | [408, 4, 30 *4] -> [408, 4, 30 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_4 0.96ms | [408, 4, 30 *4] -> [408, 4, 30 *4]
Split splitncnn_4 0.01ms |
Pooling GlobalAveragePool_2 0.38ms | [408, 4, 30 *4] -> [ 30 *4]
Convolution Conv_19 0.28ms | [ 30 *4] -> [ 30 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_8 0.02ms | [ 30 *1] -> [ 30 *1]
Convolution Conv_20 0.27ms | [ 30 *1] -> [120 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_2 0.02ms | [120 *1] -> [ 30 *4]
BinaryOp Mul_7 0.87ms |
Convolution Conv_21 1.65ms | [408, 4, 30 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_12 0.52ms |
Split splitncnn_5 0.01ms |
Convolution Conv_22 1.21ms | [408, 4, 6 *4] -> [408, 4, 30 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_5 1.00ms | [408, 4, 30 *4] -> [408, 4, 30 *4]
ConvolutionDepthWise Conv_23 2.33ms | [408, 4, 30 *4] -> [408, 4, 30 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_6 0.90ms | [408, 4, 30 *4] -> [408, 4, 30 *4]
Split splitncnn_6 0.01ms |
Pooling GlobalAveragePool_3 0.42ms | [408, 4, 30 *4] -> [ 30 *4]
Convolution Conv_24 0.30ms | [ 30 *4] -> [ 30 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_9 0.02ms | [ 30 *1] -> [ 30 *1]
Convolution Conv_25 0.28ms | [ 30 *1] -> [120 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_3 0.24ms | [120 *1] -> [ 30 *4]
BinaryOp Mul_10 0.84ms |
Convolution Conv_26 1.64ms | [408, 4, 30 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_17 0.53ms |
Split splitncnn_7 0.01ms |
Convolution Conv_27 1.01ms | [408, 4, 6 *4] -> [408, 4, 16 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_7 0.64ms | [408, 4, 16 *4] -> [408, 4, 16 *4]
ConvolutionDepthWise Conv_28 1.37ms | [408, 4, 16 *4] -> [408, 4, 16 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_8 0.64ms | [408, 4, 16 *4] -> [408, 4, 16 *4]
Split splitncnn_8 0.01ms |
Pooling GlobalAveragePool_4 0.53ms | [408, 4, 16 *4] -> [ 16 *4]
Convolution Conv_29 0.26ms | [ 16 *4] -> [ 16 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_10 0.23ms | [ 16 *1] -> [ 4 *4]
Convolution Conv_30 0.04ms | [ 4 *4] -> [ 64 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_4 0.22ms | [ 64 *1] -> [ 16 *4]
BinaryOp Mul_13 0.53ms |
Convolution Conv_31 1.31ms | [408, 4, 16 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_22 0.53ms |
Split splitncnn_9 0.01ms |
Convolution Conv_32 1.02ms | [408, 4, 6 *4] -> [408, 4, 18 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_9 0.68ms | [408, 4, 18 *4] -> [408, 4, 18 *4]
ConvolutionDepthWise Conv_33 1.75ms | [408, 4, 18 *4] -> [408, 4, 18 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_10 0.75ms | [408, 4, 18 *4] -> [408, 4, 18 *4]
Split splitncnn_10 0.01ms |
Pooling GlobalAveragePool_5 0.30ms | [408, 4, 18 *4] -> [ 18 *4]
Convolution Conv_34 0.29ms | [ 18 *4] -> [ 18 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_11 0.02ms | [ 18 *1] -> [ 18 *1]
Convolution Conv_35 0.03ms | [ 18 *1] -> [ 72 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_5 0.02ms | [ 72 *1] -> [ 18 *4]
BinaryOp Mul_16 0.58ms |
Convolution Conv_36 1.41ms | [408, 4, 18 *4] -> [408, 4, 6 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_27 0.39ms |
Convolution Conv_37 1.39ms | [408, 4, 6 *4] -> [408, 4, 36 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_11 1.06ms | [408, 4, 36 *4] -> [408, 4, 36 *4]
ConvolutionDepthWise Conv_38 3.80ms | [408, 4, 36 *4] -> [408, 2, 36 *4] kernel: 5 x 5 stride: 1 x 2
HardSwish Div_12 0.69ms | [408, 2, 36 *4] -> [408, 2, 36 *4]
Split splitncnn_11 0.01ms |
Pooling GlobalAveragePool_6 0.57ms | [408, 2, 36 *4] -> [ 36 *4]
Convolution Conv_39 0.27ms | [ 36 *4] -> [ 36 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_12 0.25ms | [ 36 *1] -> [ 9 *4]
Convolution Conv_40 0.25ms | [ 9 *4] -> [144 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_6 0.25ms | [144 *1] -> [ 36 *4]
BinaryOp Mul_19 0.55ms |
Convolution Conv_41 1.59ms | [408, 2, 36 *4] -> [408, 2, 12 *4] kernel: 1 x 1 stride: 1 x 1
Split splitncnn_12 0.01ms |
Convolution Conv_42 1.84ms | [408, 2, 12 *4] -> [408, 2, 72 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_13 1.03ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
ConvolutionDepthWise Conv_43 3.07ms | [408, 2, 72 *4] -> [408, 2, 72 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_14 1.02ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
Split splitncnn_13 0.01ms |
Pooling GlobalAveragePool_7 0.45ms | [408, 2, 72 *4] -> [ 72 *4]
Convolution Conv_44 0.35ms | [ 72 *4] -> [ 72 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_13 0.02ms | [ 72 *1] -> [ 18 *4]
Convolution Conv_45 0.32ms | [ 18 *4] -> [288 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_7 0.25ms | [288 *1] -> [ 72 *4]
BinaryOp Mul_22 0.82ms |
Convolution Conv_46 2.82ms | [408, 2, 72 *4] -> [408, 2, 12 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_36 0.39ms |
Split splitncnn_14 0.01ms |
Convolution Conv_47 1.96ms | [408, 2, 12 *4] -> [408, 2, 72 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_15 1.07ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
ConvolutionDepthWise Conv_48 3.19ms | [408, 2, 72 *4] -> [408, 2, 72 *4] kernel: 5 x 5 stride: 1 x 1
HardSwish Div_16 1.04ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
Split splitncnn_15 0.01ms |
Pooling GlobalAveragePool_8 0.72ms | [408, 2, 72 *4] -> [ 72 *4]
Convolution Conv_49 0.11ms | [ 72 *4] -> [ 72 *1] kernel: 1 x 1 stride: 1 x 1
ReLU Relu_14 0.28ms | [ 72 *1] -> [ 18 *4]
Convolution Conv_50 0.07ms | [ 18 *4] -> [288 *1] kernel: 1 x 1 stride: 1 x 1
HardSigmoid HardSigmoid_8 0.02ms | [288 *1] -> [ 72 *4]
BinaryOp Mul_25 0.93ms |
Convolution Conv_51 2.38ms | [408, 2, 72 *4] -> [408, 2, 12 *4] kernel: 1 x 1 stride: 1 x 1
BinaryOp Add_41 0.39ms |
Convolution Conv_52 1.97ms | [408, 2, 12 *4] -> [408, 2, 72 *4] kernel: 1 x 1 stride: 1 x 1
HardSwish Div_17 1.09ms | [408, 2, 72 *4] -> [408, 2, 72 *4]
Pooling MaxPool_0 0.54ms | [408, 2, 72 *4] -> [204, 1, 72 *4]
Squeeze Squeeze_0 0.50ms | [204, 1, 72 *4] -> [204, 288 *1]
Permute Transpose_3 0.63ms | [204, 288 *1] -> [288, 204 *1]
LSTM LSTM_0 45.92ms |
LSTM LSTM_4 18.50ms |
InnerProduct MatMul_0 71.96ms | [ 96, 204 *1] -> [6625, 51 *4]
MemoryData ctc_fc_b_attr 0.02ms |
BinaryOp Add_43 3.69ms |
Softmax Softmax_0 10.24ms | [6625, 51 *4] -> [6625, 51 *4]
816x32 avg = 268.85
编译/运行环境
编译软件环境:ubuntu 18.04/android-ndk-r19c/cmake 3.10.2 运行环境:android 8.1/spreadtrum sc9832e/4 xARM Cortex-A53@1400MHz 旧版本ncnn commitid:6b2495cc243f2d8e829523b700f32db1f5d50f78 新版本ncnn commitid:6fd801b6d76af8e0ed7f5f3f3b088855f832996b
复现步骤 | 再現方法
1.修改CmakeList.txt
option(NCNN_BENCHMARK "print benchmark information for every layer" ON)
2.修改benchmark/benchncnn.cpp
......
ncnn::Extractor ex = net.create_extractor();
ex.input(input_names[0], in);
ex.extract(output_names[output_names.size()-1], out);
......
benchmark("test_model", ncnn::Mat(816, 32, 3), opt);
3.编译
cd $WORK_DIR/ncnn
mkdir -p build-android-armv7
cd build-android-armv7
rm -rf *
cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI="armeabi-v7a" -DANDROID_ARM_NEON=ON -DANDROID_PLATFORM=android-16 ..
make -j8
4.运行
adb push test_model.param /data/local/tmp/
adb push benchncnn /data/local/tmp/
adb shell "chmod 0755 /data/local/tmp/benchncnn"
adb shell "cd /data/local/tmp; ./benchncnn1 4 0 -1 1"
5.查看打印日志
其他
测试模型文件详见附件: test_model.zip 新版环境下编译出的程序网络推理总耗时变长,网络层MatMul_0耗时差距过大,麻烦帮忙看看是什么原因。
https://github.com/Tencent/ncnn/pull/3799 验证下来是编译器针对 armv7 优化不力导致,改写了下
#3799 验证下来是编译器针对 armv7 优化不力导致,改写了下
经测试网络层MatMul_0耗时确实降下来了,接近原来版本的,但是跑该模型的benchmark程序100次的平均耗时表现极不稳定,相同模型/相同输入/相同次数/相同设备上多次跑benchmark的平均耗时经常相差好几十毫秒,安卓系统软件环境基本是纯净的,没装什么占用系统资源的软件。
测试发现ConvolutionDepthWise 比Convolution耗时高一截???
该问题可能是多方面原因引起的,排查起来比较难,可能不一定是ncnn的问题,一般在应用层的体验上看不出太大差别,就暂时搁置吧