The last channel handling logic in the SSE version of ShuffleChannel cause buffer-overflow
context
In SSE version of ShuffleChannel, the last channel is shuffled from the offset with the half of the granularity.
Here, it causes buffer-overflow at the load of the ptr1 at the last iteration of the for loop.
For example, in the case of (elempack == 4) && (_group == 2 && channels % _group != 0) in AVX512 optimization,
The ptr1 initially can be accessed for the range [ptr1, ptr1+4*size),
and the range is reduced into [ptr1, ptr1+4*size-2) after ptr1 += 2;.
However, at the last iteration of the for loop, it loads [ptr1+4*size, ptr1+4*(size+1)) to _p1, which leads to buffer-overflow.
Since it causes both buffer-overflow read (ptr1) and buffer-overflow write (outptr), it could lead to incorrect result of the model.
{
const float* ptr0 = bottom_blob.channel(channels_per_group);
const float* ptr1 = bottom_blob.channel(channels_per_group * 2);
float* outptr = top_blob.channel(channels_per_group * 2);
ptr1 += 2;
for (int i = 0; i < size; i++)
{
__m128 _p0 = _mm_loadu_ps(ptr0);
__m128 _p1 = _mm_loadu_ps(ptr1);
__m128 _lo = _mm_unpacklo_ps(_p0, _p1);
_mm_storeu_ps(outptr, _lo);
ptr0 += 4;
ptr1 += 4;
outptr += 4;
}
}
x86 https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L117 https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L373 https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L608
arm https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L118 https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L365 https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L599
how to reproduce
- Build with SSE in x86 or arm
- ./test_shufflechannel
more
I will open a PR of the patch for this:)
test_shufflechannel did not go wrong, and I did not find any memory anomalies using valgrind.
When I reviewed this issue again, I thought that only the last load exceeded some ranges, which is also allowed. When allocating ncnn Mat, 64 bytes will be deliberately allocated to allow this behavior. The last store out of range is not allowed because it may overwrite the reference count area, but I think the current code logic does not write data out of bounds
Can you provide a minimal reproducible example of the problem ?
see https://github.com/Tencent/ncnn/blob/cf6c700b6ab8a7353207f89aa361177c289a4130/src/allocator.h#L47