ComputeLibrary
ComputeLibrary copied to clipboard
Why is 1D convolution on CPU via NEConvolutionLayer so slow?
Benchmark details: 1d convolution of a 2^16 wide 1D input signal with a length 3 kernel. Both input and output channels are 1. There is no bias term.
% strings arm_compute-v24.06-bin-android-arm64-v8a-neon/lib/arm64-v8a-neon-asserts/libarm_compute.so | grep arm_compute_version
arm_compute_version=v24.06 Build options: {'arch': 'arm64-v8a', 'neon': '1', 'opencl': '0', 'os': 'android', 'build_dir': 'arm64-v8a-neon-asserts', 'asserts': '1', 'Werror': '1', 'embed_kernels': '1'} Git hash=unknown
Here's my benchmark: benchmark_acl.cpp
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "utils/Utils.h"
#include "arm_compute/runtime/NEON/functions/NEDeconvolutionLayer.h"
#include <chrono>
#include<iostream>
using namespace std;
using namespace arm_compute;
struct Timer {
std::chrono::time_point<std::chrono::high_resolution_clock> start;
Timer() {
start = std::chrono::high_resolution_clock::now();
}
~Timer() {
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
std::cout << "time "<< duration.count() << '\n';
}
};
int main()
{
Tensor conv_input;
Tensor conv_weight;
Tensor conv_bias;
Tensor conv_output;
const int N = 1;
const int Hi = 1;
const int Wi = 1<<20;
const int Ci = 1;
const int Hf = 1;
const int Wf = 3;
const int Ho = Hi - Hf + 1;
const int Wo = Wi - Wf + 1;
const int Co = 1;
conv_input.allocator()->init(TensorInfo(TensorShape(Hi, Wi, Ci), 1, DataType::F32, DataLayout::NHWC));
conv_weight.allocator()->init(TensorInfo(TensorShape(Hf, Wf, Ci, Co), 1, DataType::F32, DataLayout::NHWC));
// conv_bias.allocator()->init(TensorInfo(TensorShape(Co), 1, DataType::F32));
conv_output.allocator()->init(TensorInfo(TensorShape(Ho, Wo, Co), 1, DataType::F32, DataLayout::NHWC));
conv_input.allocator()->allocate();
conv_weight.allocator()->allocate();
// conv_bias.allocator()->allocate();
conv_output.allocator()->allocate();
for (int i = 0;i < conv_input.info()->tensor_shape().total_size();++i) {
((float*)conv_input.buffer())[i] = i + 1;
}
for (int i = 0;i < conv_weight.info()->tensor_shape().total_size();++i) {
((float*)conv_weight.buffer())[i] = i + 1;
}
NEConvolutionLayer conv;
// enum class ConvolutionMethod
// {
// GEMM, /**< Convolution using GEMM */
// GEMM_CONV2D, /**< Direct 2D GEMM convolution */
// DIRECT, /**< Direct convolution */
// INDIRECT, /**< Indirect convolution */
// WINOGRAD, /**< Convolution using Winograd */
// FFT /**< Convolution using FFT */
// };
// prints the number of the method: 0 for GEMM, 1 for GEMM_CONV2d, ...
cout << (int)NEConvolutionLayer::get_convolution_method(conv_input.info(), conv_weight.info(),
conv_output.info(),
PadStrideInfo(1, 1, 0, 0)
,WeightsInfo()
,Size2D(1U, 1U)
,ActivationLayerInfo()
,true) << endl;
conv.configure(&conv_input,
&conv_weight,
nullptr,
&conv_output,
PadStrideInfo(1, 1, 0, 0)
,WeightsInfo()
,Size2D(1U, 1U)
,ActivationLayerInfo()
,true // fast math enabled
);
{
Timer timer;
conv.run();
}
// verify first 5 elements of output
for (int i = 0;i < conv_output.info()->tensor_shape().total_size() && i < 5;++i) {
cout << ((float*)conv_output.buffer())[i] << ' ';
} cout << endl;
// compute sum of output to prevent compiler from removing the convolution calculation
float sum = 0;
for (int i = 0;i < conv_output.info()->tensor_shape().total_size();++i) {
sum += ((float*)conv_output.buffer())[i];
}
cout << sum << endl;
return 0;
}
output
0
time 0.0481771
14 20 26 32 38
3.29949e+12
The 0 means that the first enum element GEMM is being used. convolution of 1, 2, 3, ... with 1,2,3 is 14,20,26,32,38,..., so the correct answer is being computed.
Why is it so slow?
For reference I made my own 1D direct convolution implementation and achieved time 0.00166391
This was without openmp multithreading, just plain implementation with compiler optimizations.
What could be the reason for this?
- Are the arm_compute::Tensor initializations NHWC?
Also, here is ARM Compute Library's Direct Conv performance:
arm_compute::NEDirectConvolutionLayer conv;
conv.configure(&conv_input,
&conv_weight,
nullptr,
&conv_output,
PadStrideInfo(1, 1, 0, 0),
ActivationLayerInfo()
);
{
Timer timer;
conv.run();
}
output time 0.0609249
my device info
CPU Support ARM NEON: Yes
CPU Support ARM BF16: No
CPU Support ARM EDSP: No
CPU Support ARM VFPV4: Yes
CPU Support ARM ASIMDHP: Yes
CPU Support ARM CPUID: Yes
CPU Support ARM ASIMDDP: Yes
CPU Support ARM ASIMDFHM: No
CPU Support ARM I8MM: No
CPU Support ARM SVE: No
CPU Support ARM SVE2: No
CPU Support ARM SVEBF16: No
CPU Support ARM SVEI8MM: No
CPU Support ARM SVEF32MM: No
RISCV: No
RISCV ZFH: No
RISCV vector length in bytes: 0
CPU COUNT: 8
LITTLE CPU COUNT: 4
BIG CPU COUNT: 4
PHYSICAL CPU COUNT: 8
PHYSICAL LITTLE CPU COUNT: 4
PHYSICAL BIG CPU COUNT: 4
CPU LEVEL2 cache size: 256 KB
CPU LEVEL3 cache size: 0 KB
I compiled with against latest android cpu release shared lib
aarch64-linux-android26-clang++ **-Ofast -ffast-math src/benchmark_acl.cpp** arm_compute-v24.06-bin-android-arm64-v8a-neon/utils/Utils.cpp -Iarm_compute-v24.06-bin-android-arm64-v8a-neon -Iarm_compute-v24.06-bin-android-arm64-v8a-neon/include -std=c++14 -Larm_compute-v24.06-bin-android-arm64-v8a-neon -L arm_compute-v24.06-bin-android-arm64-v8a-neon/lib/arm64-v8a-neon/ -larm_compute-static -o bin/benchmark_acl -static-libstdc++ -pie