Why is 1D convolution on CPU via NEConvolutionLayer so slow?

Open poltomo opened this issue 7 months ago • 6 comments

Benchmark details: 1d convolution of a 2^16 wide 1D input signal with a length 3 kernel. Both input and output channels are 1. There is no bias term.

% strings arm_compute-v24.06-bin-android-arm64-v8a-neon/lib/arm64-v8a-neon-asserts/libarm_compute.so | grep arm_compute_version

arm_compute_version=v24.06 Build options: {'arch': 'arm64-v8a', 'neon': '1', 'opencl': '0', 'os': 'android', 'build_dir': 'arm64-v8a-neon-asserts', 'asserts': '1', 'Werror': '1', 'embed_kernels': '1'} Git hash=unknown

Here's my benchmark: benchmark_acl.cpp

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "arm_compute/runtime/NEON/functions/NEDeconvolutionLayer.h"

#include <chrono>
#include<iostream>

using namespace std;
using namespace arm_compute;

struct Timer {
	std::chrono::time_point<std::chrono::high_resolution_clock> start;
	Timer() {
		start = std::chrono::high_resolution_clock::now();
	}
	~Timer() {
		auto end = std::chrono::high_resolution_clock::now();
		std::chrono::duration<double> duration = end - start;
		std::cout << "time "<< duration.count() << '\n';
	}
};

int main()
{
    Tensor conv_input;
    Tensor conv_weight;
    Tensor conv_bias;
    Tensor conv_output;

    const int N = 1;
    const int Hi = 1;
    const int Wi = 1<<20;
    const int Ci = 1;

    const int Hf = 1;
    const int Wf = 3;

    const int Ho = Hi - Hf + 1;
    const int Wo = Wi - Wf + 1;
    const int Co = 1;

    conv_input.allocator()->init(TensorInfo(TensorShape(Hi, Wi, Ci), 1, DataType::F32, DataLayout::NHWC));
    conv_weight.allocator()->init(TensorInfo(TensorShape(Hf, Wf, Ci, Co), 1, DataType::F32, DataLayout::NHWC));
    // conv_bias.allocator()->init(TensorInfo(TensorShape(Co), 1, DataType::F32));
    conv_output.allocator()->init(TensorInfo(TensorShape(Ho, Wo, Co), 1, DataType::F32, DataLayout::NHWC));

    conv_input.allocator()->allocate();
    conv_weight.allocator()->allocate();
    // conv_bias.allocator()->allocate();
    conv_output.allocator()->allocate();

    for (int i = 0;i < conv_input.info()->tensor_shape().total_size();++i) {
        ((float*)conv_input.buffer())[i] = i + 1;
    }
    for (int i = 0;i < conv_weight.info()->tensor_shape().total_size();++i) {
        ((float*)conv_weight.buffer())[i] = i + 1;
    }

    NEConvolutionLayer conv;

// enum class ConvolutionMethod
// {
//     GEMM,        /**< Convolution using GEMM */
//     GEMM_CONV2D, /**< Direct 2D GEMM convolution */
//     DIRECT,      /**< Direct convolution */
//     INDIRECT,    /**< Indirect convolution */
//     WINOGRAD,    /**< Convolution using Winograd */
//     FFT          /**< Convolution using FFT */
// };

// prints the number of the method: 0 for GEMM, 1 for GEMM_CONV2d, ...
    cout << (int)NEConvolutionLayer::get_convolution_method(conv_input.info(), conv_weight.info(),
                   conv_output.info(),
                   PadStrideInfo(1, 1, 0, 0)
                   ,WeightsInfo()
                   ,Size2D(1U, 1U)
                ,ActivationLayerInfo()
                ,true) << endl;

    conv.configure(&conv_input,
                   &conv_weight,
                nullptr,
                   &conv_output,
                   PadStrideInfo(1, 1, 0, 0)
                   ,WeightsInfo()
                   ,Size2D(1U, 1U)
                ,ActivationLayerInfo()
                ,true // fast math enabled
                   );

    {
        Timer timer;
        conv.run();
    }

// verify first 5 elements of output
    for (int i = 0;i < conv_output.info()->tensor_shape().total_size() && i < 5;++i) {
        cout << ((float*)conv_output.buffer())[i] << ' ';
    } cout << endl;

// compute sum of output to prevent compiler from removing the convolution calculation
    float sum = 0;
    for (int i = 0;i < conv_output.info()->tensor_shape().total_size();++i) {
        sum += ((float*)conv_output.buffer())[i];
    }
    cout << sum << endl;

    return 0;
}

output

0
time 0.0481771
14 20 26 32 38 
3.29949e+12

The 0 means that the first enum element GEMM is being used. convolution of 1, 2, 3, ... with 1,2,3 is 14,20,26,32,38,..., so the correct answer is being computed.

Why is it so slow?

For reference I made my own 1D direct convolution implementation and achieved time 0.00166391 This was without openmp multithreading, just plain implementation with compiler optimizations.

What could be the reason for this?

Are the arm_compute::Tensor initializations NHWC?

Also, here is ARM Compute Library's Direct Conv performance:

arm_compute::NEDirectConvolutionLayer conv;
conv.configure(&conv_input,
                   &conv_weight,
                nullptr,
                   &conv_output,
                   PadStrideInfo(1, 1, 0, 0),
                ActivationLayerInfo()
                   );

    {
        Timer timer;
        conv.run();
    }

output time 0.0609249

my device info

CPU Support ARM NEON: Yes
CPU Support ARM BF16: No
CPU Support ARM EDSP: No
CPU Support ARM VFPV4: Yes
CPU Support ARM ASIMDHP: Yes
CPU Support ARM CPUID: Yes
CPU Support ARM ASIMDDP: Yes
CPU Support ARM ASIMDFHM: No
CPU Support ARM I8MM: No
CPU Support ARM SVE: No
CPU Support ARM SVE2: No
CPU Support ARM SVEBF16: No
CPU Support ARM SVEI8MM: No
CPU Support ARM SVEF32MM: No
RISCV: No
RISCV ZFH: No
RISCV vector length in bytes: 0
CPU COUNT: 8
LITTLE CPU COUNT: 4
BIG CPU COUNT: 4
PHYSICAL CPU COUNT: 8
PHYSICAL LITTLE CPU COUNT: 4
PHYSICAL BIG CPU COUNT: 4
CPU LEVEL2 cache size: 256 KB
CPU LEVEL3 cache size: 0 KB

I compiled with against latest android cpu release shared lib

aarch64-linux-android26-clang++ **-Ofast -ffast-math src/benchmark_acl.cpp** arm_compute-v24.06-bin-android-arm64-v8a-neon/utils/Utils.cpp -Iarm_compute-v24.06-bin-android-arm64-v8a-neon -Iarm_compute-v24.06-bin-android-arm64-v8a-neon/include -std=c++14 -Larm_compute-v24.06-bin-android-arm64-v8a-neon -L arm_compute-v24.06-bin-android-arm64-v8a-neon/lib/arm64-v8a-neon/ -larm_compute-static -o bin/benchmark_acl -static-libstdc++ -pie

Jul 14 '24 04:07 poltomo

ComputeLibrary ComputeLibrary copied to clipboard

Why is 1D convolution on CPU via NEConvolutionLayer so slow?

output

Why is it so slow?

my device info

I compiled with against latest android cpu release shared lib

ComputeLibrary
ComputeLibrary copied to clipboard