cub icon indicating copy to clipboard operation
cub copied to clipboard

[Bug?] WarpReduce: Unexpected results with logical warp size < 32

Open RaulPPelaez opened this issue 5 years ago • 2 comments

Not sure if this is a bug or a feature, but it surely is not the behavior the docs suggest. I tried this with cub 1.8.0 and CUDA 10.1 and 9.2.

Say I have a single block with warpSize*n threads, each threads holds some value thread_data = 1. I set up a warpReduce with a logical warp size of warpSize. Then I perform the sum operation in WarpReduce for each logical warp. I would expect the aggregate to be warpSize for each first logical lane (threadIdx.x%warpSize == 0). The docs hint this behavior and it is indeed what happens with a warpSize that is a multiple of 32 (16, 8...). However some warpSize values produces unexpected results. Look at this MWE:

#include<cub/cub.cuh>
#include<cstdio>

template<int warpSize, int warpsPerBlock>
__global__ void kern(){  
  using WarpReduce = cub::WarpReduce<int, warpSize>;
  __shared__ typename WarpReduce::TempStorage temp_storage[warpsPerBlock];
  const auto warp_id = threadIdx.x/warpSize;
  const auto lane = threadIdx.x%warpSize;
  int thread_data = 1;  
  int aggregate = WarpReduce(temp_storage[warp_id]).Sum(thread_data);   
  if(lane == 0){    
    printf("thread: %d, warp_id: %d,  aggregate: %d\n", threadIdx.x, warp_id, aggregate);
  }
}

int main(){
  constexpr int warpSize = 7;
  constexpr int warpsPerBlock = 7;
  int nblocks = 1;
  kern<warpSize, warpsPerBlock><<<nblocks, warpSize*warpsPerBlock>>>();
  cudaDeviceSynchronize();
  return 0;
}

Running it results in the following output:

$ nvcc -arch=sm_52  mwe.cu -run
thread: 35, warp_id: 5,  aggregate: 4
thread: 42, warp_id: 6,  aggregate: 4
thread: 0, warp_id: 0,  aggregate: 7
thread: 7, warp_id: 1,  aggregate: 7
thread: 14, warp_id: 2,  aggregate: 7
thread: 21, warp_id: 3,  aggregate: 7
thread: 28, warp_id: 4,  aggregate: 4

The same result is obtained for different architectures. Change warpSize to 8 and you get aggregate = 8 in all lanes = 0. Change it to a value between 9 and 15 and again you get odd behavior past the first few logical warps. Then 16 works fine and the same odd behavior arises until warpSize = 32.

The docs seem to imply that logical_warp_size should not be greater than the real warp size, so anything beyond it I considered expected. The docs also seem to imply that warpSize does not have to be a whole multiple of 32 by saying: "[...] Computation is slightly more efficient for [...] The architecture's warp size is a whole multiple of LOGICAL_WARP_THREADS"
under Performance Considerations.

RaulPPelaez avatar Feb 03 '20 13:02 RaulPPelaez

Hi, RaulPPelaez: Please see #380 .I think I have find the problem. It's because a lane ID comutation error when using shared memory warp reduction in a not warp size is not power of 2 case. When the logical warp is lives in both phsical warp, it will give incorrect answer. For example, 7 therads per warp and 5 warps per block,the last warp lane id is :

threadId 28 29 30 31 32 33 34
lane ID(incorrect) 0 1 2 3 0 1 2
lane ID(correct) 0 1 2 3 4 5 6

After fix this bug, result is correct.

Here is code explaination:

 /// Constructor
    __device__ __forceinline__ WarpReduceSmem(
        TempStorage     &temp_storage)
    :
        temp_storage(temp_storage.Alias()),

        lane_id(IS_ARCH_WARP ?
            LaneId() :
            // LaneId() % LOGICAL_WARP_THREADS), // incorrect , it will give wrong lane_id
           threadIdx.x% LOGICAL_WARP_THREADS), // correct

        member_mask((0xffffffff >> (32 - LOGICAL_WARP_THREADS)) << ((IS_ARCH_WARP || !IS_POW_OF_TWO ) ?
            0 : // arch-width and non-power-of-two subwarps cannot be tiled with the arch-warp
            ((LaneId() / LOGICAL_WARP_THREADS) * LOGICAL_WARP_THREADS)))
    {}

lane_id will be used to as a array index to choose data on shared memory:

 // Update input if peer_addend is in range
        if ((ALL_LANES_VALID && IS_POW_OF_TWO) || ((lane_id + OFFSET) < valid_items))
        {
            // lane_id  will be used to compute index for a shared memory data
            T peer_addend = ThreadLoad<LOAD_VOLATILE>(&temp_storage.reduce[lane_id + OFFSET]);
            input = reduction_op(input, peer_addend);
        }

YinLiu-91 avatar Sep 25 '21 15:09 YinLiu-91

Hello, @RaulPPelaez, @YinLiu-91! Thank you for noting this. Unfortunately, this behaviour is expected but ill documented. First of all, if we partition a hardware warp into tiles of any non-power-of-two threads, we'll end up with the case below.

image

It means that the warp synchronization can't be used anymore. We'd have to use thread block synchronization to fix possible data races, which indicates that this case is out of warp scope.

But even if we used this synchronization, we'd have to use thread id in the block to compute virtual warp id. To compute it, we'd have to know thread block dimensions, which is another indication that this case is out of the warp-scope layer.

Tests cover non-power-of-two modes for these classes here and here. The only difference is that there's only one virtual warp in this mode. Therefore, there's no tiling of hardware warp with virtual ones.

I wish the documentation stated this more clearly.

gevtushenko avatar Sep 26 '21 15:09 gevtushenko