The last channel handling logic in the SSE version of ShuffleChannel cause buffer-overflow

Open junwha opened this issue 1 year ago • 1 comments

context

In SSE version of ShuffleChannel, the last channel is shuffled from the offset with the half of the granularity. Here, it causes buffer-overflow at the load of the ptr1 at the last iteration of the for loop.

For example, in the case of (elempack == 4) && (_group == 2 && channels % _group != 0) in AVX512 optimization,

The ptr1 initially can be accessed for the range [ptr1, ptr1+4*size), and the range is reduced into [ptr1, ptr1+4*size-2) after ptr1 += 2;. However, at the last iteration of the for loop, it loads [ptr1+4*size, ptr1+4*(size+1)) to _p1, which leads to buffer-overflow.

Since it causes both buffer-overflow read (ptr1) and buffer-overflow write (outptr), it could lead to incorrect result of the model.

{
      const float* ptr0 = bottom_blob.channel(channels_per_group);
      const float* ptr1 = bottom_blob.channel(channels_per_group * 2);
      float* outptr = top_blob.channel(channels_per_group * 2);

      ptr1 += 2;

      for (int i = 0; i < size; i++)
      {
          __m128 _p0 = _mm_loadu_ps(ptr0);
          __m128 _p1 = _mm_loadu_ps(ptr1);

          __m128 _lo = _mm_unpacklo_ps(_p0, _p1);

          _mm_storeu_ps(outptr, _lo);

          ptr0 += 4;
          ptr1 += 4;
          outptr += 4;
      }
  }

x86 https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L117 https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L373 https://github.com/Tencent/ncnn/blob/master/src/layer/x86/shufflechannel_x86.cpp#L608

arm https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L118 https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L365 https://github.com/Tencent/ncnn/blob/master/src/layer/arm/shufflechannel_arm.cpp#L599

how to reproduce

Build with SSE in x86 or arm
./test_shufflechannel

I will open a PR of the patch for this:)

Oct 14 '24 22:10 junwha

test_shufflechannel did not go wrong, and I did not find any memory anomalies using valgrind.

When I reviewed this issue again, I thought that only the last load exceeded some ranges, which is also allowed. When allocating ncnn Mat, 64 bytes will be deliberately allocated to allow this behavior. The last store out of range is not allowed because it may overwrite the reference count area, but I think the current code logic does not write data out of bounds

Can you provide a minimal reproducible example of the problem ?

see https://github.com/Tencent/ncnn/blob/cf6c700b6ab8a7353207f89aa361177c289a4130/src/allocator.h#L47

Apr 27 '25 02:04 nihui

The last channel handling logic in the SSE version of ShuffleChannel cause buffer-overflow

context

how to reproduce

more