Summary

HIP's implementation of __sync* primitives don't conform to the CUDA specification of the functions. CUDA says:

"If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection)."

Source: https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#warp-shuffle-functions

However HIP's implementation is incorrect for width values other than 64, for example:

# CUDA results with width = 8
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 0, shfl_val: 0
tid: 5, val: 5, shfl_src: 1, shfl_val: 1
tid: 6, val: 6, shfl_src: 2, shfl_val: 2
tid: 7, val: 7, shfl_src: 3, shfl_val: 3

# HIP results with width = 8
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 0, shfl_val: 8
tid: 5, val: 5, shfl_src: 1, shfl_val: 9
tid: 6, val: 6, shfl_src: 2, shfl_val: 10
tid: 7, val: 7, shfl_src: 3, shfl_val: 11

See CUDA source code & HIP source code for the source code.

Culprit

The reason behind the incorrect values is the LLVM IR code generation step, it generates incorrect code for __sync*.

The LLVM IR code generated for __sync(value, threadId + delta, width) is:

  ; i32 %2 = delta, i32 %3 = width, %5 = shuffled value
  ; %8 = add i32 %5, %2

  %9 = tail call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %10 = tail call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %9)
  %11 = sub i32 0, %3
  %12 = and i32 %10, %11
  %13 = add nsw i32 %8, %12
  %14 = shl i32 %13, 2
  %15 = tail call i32 @llvm.amdgcn.ds.bpermute(i32 %14, i32 %5)

The correct implementation for __sync(%val, %srcLane, %width) would be:

  ; srcModWidth = (srcLane % width) = (srcLane & (width - 1))
  ; baseLane = (laneId / width) * width = (laneId & (-width))
  ; lane = srcModWidth + baseLane

  %0 = tail call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
  %laneId = tail call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %0)
  
  %1 = sub i32 %width, 1
  %srcModWidth = and i32 %srcLane, %1

  %2 = sub i32 0, %width
  %baseLane = and i32 %laneId, %2
  %lane = add nsw i32 %srcModWidth, %baseLane

  %laneAddr = shl i32 %lane, 2
  %__sync = tail call i32 @llvm.amdgcn.ds.bpermute(i32 %laneAddr, i32 %val)

CUDA

CUDA source code

#include <cuda_runtime.h>
#include <iostream>
#include <string>

using namespace std;

bool isPow2(size_t x) { return (x != 0) && ((x & (x - 1)) == 0); }

void __global__ shfl(int *x, int n, int delta, int width) {
  auto tid = threadIdx.x;
  if (tid < n) {
    // Shufle `tid` to the thread with id: `(tid + delta) % width`.
    auto val = __shfl_sync(0xFFFFFFFF, tid, tid + delta, width);
    x[tid] = val;
  }
}

int main(int argc, char **argv) {
  const size_t n = 64;
  const size_t delta = 4;
  size_t width = 8; // width parameter.
  if (argc > 1) {
    auto tmp = stoul(argv[1]);
    width = isPow2(tmp) ? tmp : width;
  }

  int *x{};

  auto status = cudaMallocManaged(&x, n * sizeof(int));

  shfl<<<1, n>>>(x, n, delta, width);
  cudaDeviceSynchronize();
  // Print results:
  for (size_t tid = 0; tid < n; ++tid) {
    // src id computation explanation:
    // "If width is less than warpSize then each subsection of the warp behaves
    // as a separate entity with a starting logical lane ID of 0. If srcLane is
    // outside the range [0:width-1], the value returned corresponds to the
    // value of var held by the srcLane modulo width (i.e. within the same
    // subsection)."
    // Source:
    // https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#warp-shuffle-functions
    auto src = ((tid + delta) % width) + (tid / width) * width;
    // Print the thread id, the value to be shuffled, the source id of the
    // incoming shuffled value, and the shuffled value.
    cerr << "tid: " << tid << ", val: " << tid << ", shfl_src: " << src
         << ", shfl_val: " << x[tid] << endl;
    if ((tid + 1) % width == 0)
      cerr << endl;
  }

  cudaFree(x);
  return 0;
}

Test bed:

Machine:
GPU: NVIDIA V100
CUDA: 11.8
Compile commnad: nvcc shfl.cu -o shfl.exe
Running command: ./shfl.exe <width>

CUDA results for `width = 8`

tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 0, shfl_val: 0
tid: 5, val: 5, shfl_src: 1, shfl_val: 1
tid: 6, val: 6, shfl_src: 2, shfl_val: 2
tid: 7, val: 7, shfl_src: 3, shfl_val: 3

tid: 8, val: 8, shfl_src: 12, shfl_val: 12
tid: 9, val: 9, shfl_src: 13, shfl_val: 13
tid: 10, val: 10, shfl_src: 14, shfl_val: 14
tid: 11, val: 11, shfl_src: 15, shfl_val: 15
tid: 12, val: 12, shfl_src: 8, shfl_val: 8
tid: 13, val: 13, shfl_src: 9, shfl_val: 9
tid: 14, val: 14, shfl_src: 10, shfl_val: 10
tid: 15, val: 15, shfl_src: 11, shfl_val: 11

tid: 16, val: 16, shfl_src: 20, shfl_val: 20
tid: 17, val: 17, shfl_src: 21, shfl_val: 21
tid: 18, val: 18, shfl_src: 22, shfl_val: 22
tid: 19, val: 19, shfl_src: 23, shfl_val: 23
tid: 20, val: 20, shfl_src: 16, shfl_val: 16
tid: 21, val: 21, shfl_src: 17, shfl_val: 17
tid: 22, val: 22, shfl_src: 18, shfl_val: 18
tid: 23, val: 23, shfl_src: 19, shfl_val: 19

tid: 24, val: 24, shfl_src: 28, shfl_val: 28
tid: 25, val: 25, shfl_src: 29, shfl_val: 29
tid: 26, val: 26, shfl_src: 30, shfl_val: 30
tid: 27, val: 27, shfl_src: 31, shfl_val: 31
tid: 28, val: 28, shfl_src: 24, shfl_val: 24
tid: 29, val: 29, shfl_src: 25, shfl_val: 25
tid: 30, val: 30, shfl_src: 26, shfl_val: 26
tid: 31, val: 31, shfl_src: 27, shfl_val: 27

tid: 32, val: 32, shfl_src: 36, shfl_val: 36
tid: 33, val: 33, shfl_src: 37, shfl_val: 37
tid: 34, val: 34, shfl_src: 38, shfl_val: 38
tid: 35, val: 35, shfl_src: 39, shfl_val: 39
tid: 36, val: 36, shfl_src: 32, shfl_val: 32
tid: 37, val: 37, shfl_src: 33, shfl_val: 33
tid: 38, val: 38, shfl_src: 34, shfl_val: 34
tid: 39, val: 39, shfl_src: 35, shfl_val: 35

tid: 40, val: 40, shfl_src: 44, shfl_val: 44
tid: 41, val: 41, shfl_src: 45, shfl_val: 45
tid: 42, val: 42, shfl_src: 46, shfl_val: 46
tid: 43, val: 43, shfl_src: 47, shfl_val: 47
tid: 44, val: 44, shfl_src: 40, shfl_val: 40
tid: 45, val: 45, shfl_src: 41, shfl_val: 41
tid: 46, val: 46, shfl_src: 42, shfl_val: 42
tid: 47, val: 47, shfl_src: 43, shfl_val: 43

tid: 48, val: 48, shfl_src: 52, shfl_val: 52
tid: 49, val: 49, shfl_src: 53, shfl_val: 53
tid: 50, val: 50, shfl_src: 54, shfl_val: 54
tid: 51, val: 51, shfl_src: 55, shfl_val: 55
tid: 52, val: 52, shfl_src: 48, shfl_val: 48
tid: 53, val: 53, shfl_src: 49, shfl_val: 49
tid: 54, val: 54, shfl_src: 50, shfl_val: 50
tid: 55, val: 55, shfl_src: 51, shfl_val: 51

tid: 56, val: 56, shfl_src: 60, shfl_val: 60
tid: 57, val: 57, shfl_src: 61, shfl_val: 61
tid: 58, val: 58, shfl_src: 62, shfl_val: 62
tid: 59, val: 59, shfl_src: 63, shfl_val: 63
tid: 60, val: 60, shfl_src: 56, shfl_val: 56
tid: 61, val: 61, shfl_src: 57, shfl_val: 57
tid: 62, val: 62, shfl_src: 58, shfl_val: 58
tid: 63, val: 63, shfl_src: 59, shfl_val: 59

CUDA results for `width = 32`

tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 8, shfl_val: 8
tid: 5, val: 5, shfl_src: 9, shfl_val: 9
tid: 6, val: 6, shfl_src: 10, shfl_val: 10
tid: 7, val: 7, shfl_src: 11, shfl_val: 11
tid: 8, val: 8, shfl_src: 12, shfl_val: 12
tid: 9, val: 9, shfl_src: 13, shfl_val: 13
tid: 10, val: 10, shfl_src: 14, shfl_val: 14
tid: 11, val: 11, shfl_src: 15, shfl_val: 15
tid: 12, val: 12, shfl_src: 16, shfl_val: 16
tid: 13, val: 13, shfl_src: 17, shfl_val: 17
tid: 14, val: 14, shfl_src: 18, shfl_val: 18
tid: 15, val: 15, shfl_src: 19, shfl_val: 19
tid: 16, val: 16, shfl_src: 20, shfl_val: 20
tid: 17, val: 17, shfl_src: 21, shfl_val: 21
tid: 18, val: 18, shfl_src: 22, shfl_val: 22
tid: 19, val: 19, shfl_src: 23, shfl_val: 23
tid: 20, val: 20, shfl_src: 24, shfl_val: 24
tid: 21, val: 21, shfl_src: 25, shfl_val: 25
tid: 22, val: 22, shfl_src: 26, shfl_val: 26
tid: 23, val: 23, shfl_src: 27, shfl_val: 27
tid: 24, val: 24, shfl_src: 28, shfl_val: 28
tid: 25, val: 25, shfl_src: 29, shfl_val: 29
tid: 26, val: 26, shfl_src: 30, shfl_val: 30
tid: 27, val: 27, shfl_src: 31, shfl_val: 31
tid: 28, val: 28, shfl_src: 0, shfl_val: 0
tid: 29, val: 29, shfl_src: 1, shfl_val: 1
tid: 30, val: 30, shfl_src: 2, shfl_val: 2
tid: 31, val: 31, shfl_src: 3, shfl_val: 3

tid: 32, val: 32, shfl_src: 36, shfl_val: 36
tid: 33, val: 33, shfl_src: 37, shfl_val: 37
tid: 34, val: 34, shfl_src: 38, shfl_val: 38
tid: 35, val: 35, shfl_src: 39, shfl_val: 39
tid: 36, val: 36, shfl_src: 40, shfl_val: 40
tid: 37, val: 37, shfl_src: 41, shfl_val: 41
tid: 38, val: 38, shfl_src: 42, shfl_val: 42
tid: 39, val: 39, shfl_src: 43, shfl_val: 43
tid: 40, val: 40, shfl_src: 44, shfl_val: 44
tid: 41, val: 41, shfl_src: 45, shfl_val: 45
tid: 42, val: 42, shfl_src: 46, shfl_val: 46
tid: 43, val: 43, shfl_src: 47, shfl_val: 47
tid: 44, val: 44, shfl_src: 48, shfl_val: 48
tid: 45, val: 45, shfl_src: 49, shfl_val: 49
tid: 46, val: 46, shfl_src: 50, shfl_val: 50
tid: 47, val: 47, shfl_src: 51, shfl_val: 51
tid: 48, val: 48, shfl_src: 52, shfl_val: 52
tid: 49, val: 49, shfl_src: 53, shfl_val: 53
tid: 50, val: 50, shfl_src: 54, shfl_val: 54
tid: 51, val: 51, shfl_src: 55, shfl_val: 55
tid: 52, val: 52, shfl_src: 56, shfl_val: 56
tid: 53, val: 53, shfl_src: 57, shfl_val: 57
tid: 54, val: 54, shfl_src: 58, shfl_val: 58
tid: 55, val: 55, shfl_src: 59, shfl_val: 59
tid: 56, val: 56, shfl_src: 60, shfl_val: 60
tid: 57, val: 57, shfl_src: 61, shfl_val: 61
tid: 58, val: 58, shfl_src: 62, shfl_val: 62
tid: 59, val: 59, shfl_src: 63, shfl_val: 63
tid: 60, val: 60, shfl_src: 32, shfl_val: 32
tid: 61, val: 61, shfl_src: 33, shfl_val: 33
tid: 62, val: 62, shfl_src: 34, shfl_val: 34
tid: 63, val: 63, shfl_src: 35, shfl_val: 35

HIP:

HIP source code:

For a explanation of the code see the section: CUDA source code;

#include <hip/hip_runtime.h>
#include <iostream>
#include <string>

using namespace std;

bool isPow2(size_t x) { return (x != 0) && ((x & (x - 1)) == 0); }

constexpr size_t width = 8; // width parameter.

void __global__ shfl(int *x, int n, int delta, int width) {
  auto tid = threadIdx.x;
  if (tid < n) {
    // Closest HIP funciton to CUDA's __shfl_sync
    auto val = __shfl(tid, tid + delta, width);
    x[tid] = val;
  }
}

int main(int argc, char **argv) {
  const size_t n = 64;
  const size_t delta = 4;
  size_t width = 8; // width parameter.
  if (argc > 1) {
    auto tmp = stoul(argv[1]);
    width = isPow2(tmp) ? tmp : width;
  }

  int *x{};

  auto status = hipMallocManaged(&x, n * sizeof(int));

  shfl<<<1, n>>>(x, n, delta, width);
  hipDeviceSynchronize();

  for (size_t tid = 0; tid < n; ++tid) {
    auto src = ((tid + delta) % width) + (tid / width) * width;
    cerr << "tid: " << tid << ", val: " << tid << ", shfl_src: " << src
         << ", shfl_val: " << x[tid] << endl;
    if ((tid + 1) % width == 0)
      cerr << endl;
  }

  hipFree(x);
  return 0;
}

Test bed:

Machine: Frontier ORNL
GPU: AMD MI250X
HIP: 5.4.3
Compile commnad: hipcc --offload-arch=gfx90a -x hip shfl.hip -o shfl.exe
Running command: srun -A $PROJ_ID -N 1 -t 00:00:05 ./shfl.exe <width>

HIP results for `width = 8`

tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 0, shfl_val: 8
tid: 5, val: 5, shfl_src: 1, shfl_val: 9
tid: 6, val: 6, shfl_src: 2, shfl_val: 10
tid: 7, val: 7, shfl_src: 3, shfl_val: 11

tid: 8, val: 8, shfl_src: 12, shfl_val: 20
tid: 9, val: 9, shfl_src: 13, shfl_val: 21
tid: 10, val: 10, shfl_src: 14, shfl_val: 22
tid: 11, val: 11, shfl_src: 15, shfl_val: 23
tid: 12, val: 12, shfl_src: 8, shfl_val: 24
tid: 13, val: 13, shfl_src: 9, shfl_val: 25
tid: 14, val: 14, shfl_src: 10, shfl_val: 26
tid: 15, val: 15, shfl_src: 11, shfl_val: 27

tid: 16, val: 16, shfl_src: 20, shfl_val: 36
tid: 17, val: 17, shfl_src: 21, shfl_val: 37
tid: 18, val: 18, shfl_src: 22, shfl_val: 38
tid: 19, val: 19, shfl_src: 23, shfl_val: 39
tid: 20, val: 20, shfl_src: 16, shfl_val: 40
tid: 21, val: 21, shfl_src: 17, shfl_val: 41
tid: 22, val: 22, shfl_src: 18, shfl_val: 42
tid: 23, val: 23, shfl_src: 19, shfl_val: 43

tid: 24, val: 24, shfl_src: 28, shfl_val: 52
tid: 25, val: 25, shfl_src: 29, shfl_val: 53
tid: 26, val: 26, shfl_src: 30, shfl_val: 54
tid: 27, val: 27, shfl_src: 31, shfl_val: 55
tid: 28, val: 28, shfl_src: 24, shfl_val: 56
tid: 29, val: 29, shfl_src: 25, shfl_val: 57
tid: 30, val: 30, shfl_src: 26, shfl_val: 58
tid: 31, val: 31, shfl_src: 27, shfl_val: 59

tid: 32, val: 32, shfl_src: 36, shfl_val: 4
tid: 33, val: 33, shfl_src: 37, shfl_val: 5
tid: 34, val: 34, shfl_src: 38, shfl_val: 6
tid: 35, val: 35, shfl_src: 39, shfl_val: 7
tid: 36, val: 36, shfl_src: 32, shfl_val: 8
tid: 37, val: 37, shfl_src: 33, shfl_val: 9
tid: 38, val: 38, shfl_src: 34, shfl_val: 10
tid: 39, val: 39, shfl_src: 35, shfl_val: 11

tid: 40, val: 40, shfl_src: 44, shfl_val: 20
tid: 41, val: 41, shfl_src: 45, shfl_val: 21
tid: 42, val: 42, shfl_src: 46, shfl_val: 22
tid: 43, val: 43, shfl_src: 47, shfl_val: 23
tid: 44, val: 44, shfl_src: 40, shfl_val: 24
tid: 45, val: 45, shfl_src: 41, shfl_val: 25
tid: 46, val: 46, shfl_src: 42, shfl_val: 26
tid: 47, val: 47, shfl_src: 43, shfl_val: 27

tid: 48, val: 48, shfl_src: 52, shfl_val: 36
tid: 49, val: 49, shfl_src: 53, shfl_val: 37
tid: 50, val: 50, shfl_src: 54, shfl_val: 38
tid: 51, val: 51, shfl_src: 55, shfl_val: 39
tid: 52, val: 52, shfl_src: 48, shfl_val: 40
tid: 53, val: 53, shfl_src: 49, shfl_val: 41
tid: 54, val: 54, shfl_src: 50, shfl_val: 42
tid: 55, val: 55, shfl_src: 51, shfl_val: 43

tid: 56, val: 56, shfl_src: 60, shfl_val: 52
tid: 57, val: 57, shfl_src: 61, shfl_val: 53
tid: 58, val: 58, shfl_src: 62, shfl_val: 54
tid: 59, val: 59, shfl_src: 63, shfl_val: 55
tid: 60, val: 60, shfl_src: 56, shfl_val: 56
tid: 61, val: 61, shfl_src: 57, shfl_val: 57
tid: 62, val: 62, shfl_src: 58, shfl_val: 58
tid: 63, val: 63, shfl_src: 59, shfl_val: 59

HIP results for `width = 32`

tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 8, shfl_val: 8
tid: 5, val: 5, shfl_src: 9, shfl_val: 9
tid: 6, val: 6, shfl_src: 10, shfl_val: 10
tid: 7, val: 7, shfl_src: 11, shfl_val: 11
tid: 8, val: 8, shfl_src: 12, shfl_val: 12
tid: 9, val: 9, shfl_src: 13, shfl_val: 13
tid: 10, val: 10, shfl_src: 14, shfl_val: 14
tid: 11, val: 11, shfl_src: 15, shfl_val: 15
tid: 12, val: 12, shfl_src: 16, shfl_val: 16
tid: 13, val: 13, shfl_src: 17, shfl_val: 17
tid: 14, val: 14, shfl_src: 18, shfl_val: 18
tid: 15, val: 15, shfl_src: 19, shfl_val: 19
tid: 16, val: 16, shfl_src: 20, shfl_val: 20
tid: 17, val: 17, shfl_src: 21, shfl_val: 21
tid: 18, val: 18, shfl_src: 22, shfl_val: 22
tid: 19, val: 19, shfl_src: 23, shfl_val: 23
tid: 20, val: 20, shfl_src: 24, shfl_val: 24
tid: 21, val: 21, shfl_src: 25, shfl_val: 25
tid: 22, val: 22, shfl_src: 26, shfl_val: 26
tid: 23, val: 23, shfl_src: 27, shfl_val: 27
tid: 24, val: 24, shfl_src: 28, shfl_val: 28
tid: 25, val: 25, shfl_src: 29, shfl_val: 29
tid: 26, val: 26, shfl_src: 30, shfl_val: 30
tid: 27, val: 27, shfl_src: 31, shfl_val: 31
tid: 28, val: 28, shfl_src: 0, shfl_val: 32
tid: 29, val: 29, shfl_src: 1, shfl_val: 33
tid: 30, val: 30, shfl_src: 2, shfl_val: 34
tid: 31, val: 31, shfl_src: 3, shfl_val: 35

tid: 32, val: 32, shfl_src: 36, shfl_val: 4
tid: 33, val: 33, shfl_src: 37, shfl_val: 5
tid: 34, val: 34, shfl_src: 38, shfl_val: 6
tid: 35, val: 35, shfl_src: 39, shfl_val: 7
tid: 36, val: 36, shfl_src: 40, shfl_val: 8
tid: 37, val: 37, shfl_src: 41, shfl_val: 9
tid: 38, val: 38, shfl_src: 42, shfl_val: 10
tid: 39, val: 39, shfl_src: 43, shfl_val: 11
tid: 40, val: 40, shfl_src: 44, shfl_val: 12
tid: 41, val: 41, shfl_src: 45, shfl_val: 13
tid: 42, val: 42, shfl_src: 46, shfl_val: 14
tid: 43, val: 43, shfl_src: 47, shfl_val: 15
tid: 44, val: 44, shfl_src: 48, shfl_val: 16
tid: 45, val: 45, shfl_src: 49, shfl_val: 17
tid: 46, val: 46, shfl_src: 50, shfl_val: 18
tid: 47, val: 47, shfl_src: 51, shfl_val: 19
tid: 48, val: 48, shfl_src: 52, shfl_val: 20
tid: 49, val: 49, shfl_src: 53, shfl_val: 21
tid: 50, val: 50, shfl_src: 54, shfl_val: 22
tid: 51, val: 51, shfl_src: 55, shfl_val: 23
tid: 52, val: 52, shfl_src: 56, shfl_val: 24
tid: 53, val: 53, shfl_src: 57, shfl_val: 25
tid: 54, val: 54, shfl_src: 58, shfl_val: 26
tid: 55, val: 55, shfl_src: 59, shfl_val: 27
tid: 56, val: 56, shfl_src: 60, shfl_val: 28
tid: 57, val: 57, shfl_src: 61, shfl_val: 29
tid: 58, val: 58, shfl_src: 62, shfl_val: 30
tid: 59, val: 59, shfl_src: 63, shfl_val: 31
tid: 60, val: 60, shfl_src: 32, shfl_val: 32
tid: 61, val: 61, shfl_src: 33, shfl_val: 33
tid: 62, val: 62, shfl_src: 34, shfl_val: 34
tid: 63, val: 63, shfl_src: 35, shfl_val: 35

HIP results for `width = 64`

tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 8, shfl_val: 8
tid: 5, val: 5, shfl_src: 9, shfl_val: 9
tid: 6, val: 6, shfl_src: 10, shfl_val: 10
tid: 7, val: 7, shfl_src: 11, shfl_val: 11
tid: 8, val: 8, shfl_src: 12, shfl_val: 12
tid: 9, val: 9, shfl_src: 13, shfl_val: 13
tid: 10, val: 10, shfl_src: 14, shfl_val: 14
tid: 11, val: 11, shfl_src: 15, shfl_val: 15
tid: 12, val: 12, shfl_src: 16, shfl_val: 16
tid: 13, val: 13, shfl_src: 17, shfl_val: 17
tid: 14, val: 14, shfl_src: 18, shfl_val: 18
tid: 15, val: 15, shfl_src: 19, shfl_val: 19
tid: 16, val: 16, shfl_src: 20, shfl_val: 20
tid: 17, val: 17, shfl_src: 21, shfl_val: 21
tid: 18, val: 18, shfl_src: 22, shfl_val: 22
tid: 19, val: 19, shfl_src: 23, shfl_val: 23
tid: 20, val: 20, shfl_src: 24, shfl_val: 24
tid: 21, val: 21, shfl_src: 25, shfl_val: 25
tid: 22, val: 22, shfl_src: 26, shfl_val: 26
tid: 23, val: 23, shfl_src: 27, shfl_val: 27
tid: 24, val: 24, shfl_src: 28, shfl_val: 28
tid: 25, val: 25, shfl_src: 29, shfl_val: 29
tid: 26, val: 26, shfl_src: 30, shfl_val: 30
tid: 27, val: 27, shfl_src: 31, shfl_val: 31
tid: 28, val: 28, shfl_src: 32, shfl_val: 32
tid: 29, val: 29, shfl_src: 33, shfl_val: 33
tid: 30, val: 30, shfl_src: 34, shfl_val: 34
tid: 31, val: 31, shfl_src: 35, shfl_val: 35
tid: 32, val: 32, shfl_src: 36, shfl_val: 36
tid: 33, val: 33, shfl_src: 37, shfl_val: 37
tid: 34, val: 34, shfl_src: 38, shfl_val: 38
tid: 35, val: 35, shfl_src: 39, shfl_val: 39
tid: 36, val: 36, shfl_src: 40, shfl_val: 40
tid: 37, val: 37, shfl_src: 41, shfl_val: 41
tid: 38, val: 38, shfl_src: 42, shfl_val: 42
tid: 39, val: 39, shfl_src: 43, shfl_val: 43
tid: 40, val: 40, shfl_src: 44, shfl_val: 44
tid: 41, val: 41, shfl_src: 45, shfl_val: 45
tid: 42, val: 42, shfl_src: 46, shfl_val: 46
tid: 43, val: 43, shfl_src: 47, shfl_val: 47
tid: 44, val: 44, shfl_src: 48, shfl_val: 48
tid: 45, val: 45, shfl_src: 49, shfl_val: 49
tid: 46, val: 46, shfl_src: 50, shfl_val: 50
tid: 47, val: 47, shfl_src: 51, shfl_val: 51
tid: 48, val: 48, shfl_src: 52, shfl_val: 52
tid: 49, val: 49, shfl_src: 53, shfl_val: 53
tid: 50, val: 50, shfl_src: 54, shfl_val: 54
tid: 51, val: 51, shfl_src: 55, shfl_val: 55
tid: 52, val: 52, shfl_src: 56, shfl_val: 56
tid: 53, val: 53, shfl_src: 57, shfl_val: 57
tid: 54, val: 54, shfl_src: 58, shfl_val: 58
tid: 55, val: 55, shfl_src: 59, shfl_val: 59
tid: 56, val: 56, shfl_src: 60, shfl_val: 60
tid: 57, val: 57, shfl_src: 61, shfl_val: 61
tid: 58, val: 58, shfl_src: 62, shfl_val: 62
tid: 59, val: 59, shfl_src: 63, shfl_val: 63
tid: 60, val: 60, shfl_src: 0, shfl_val: 0
tid: 61, val: 61, shfl_src: 1, shfl_val: 1
tid: 62, val: 62, shfl_src: 2, shfl_val: 2
tid: 63, val: 63, shfl_src: 3, shfl_val: 3

Apr 23 '23 15:04 fabianmcg

Rabian, thanks for the report. Tagged one of my colleagues. Siuchi.

Apr 23 '23 19:04 ronlieb

It seems to be a HIP header issue

https://github.com/ROCm-Developer-Tools/hipamd/blob/b242cbcaa52e1ee9293382996c6573d2b9f3601a/include/hip/amd_detail/amd_warp_functions.h#L90

index is calculated incorrectly.

Apr 24 '23 13:04 yxsamliu

It seems index should be:

int index = src_lane%width + (self & ~(width-1));

Apr 24 '23 13:04 yxsamliu

It never occurred to me check if these were implemented at a higher level, that's even easier to fix. I think this is issue is present in all __sync* intrinsics. I think the index in this case should be:

(srcLane & (width - 1)) + (laneId & (-width)) = (srcLane % width) + ((laneId / width) * width)

Apr 24 '23 13:04 fabianmcg

integer division is expansive. Since width must be power of 2, I assume laneId/width*width is equivalent to laneId &~(width-1) ?

Apr 24 '23 14:04 yxsamliu

Yeah, I think (laneId & (-width)) = laneId &~(width-1) = laneId / width * width.

Apr 24 '23 14:04 fabianmcg

@yxsamliu , has it been fixed?

Sep 02 '23 06:09 Epliz

it has been fixed https://github.com/ROCm/clr/commit/23e99dbb0785042c8f548f1dd7bbbb7554ed825b

Apr 12 '24 16:04 yxsamliu

HIP
HIP copied to clipboard

__sync codegen is incorrect wrt CUDA spec in ROCM 5.4.3

Summary

Culprit

CUDA

CUDA source code

Test bed:

CUDA results for `width = 8`

CUDA results for `width = 32`

HIP:

HIP source code:

Test bed:

HIP results for `width = 8`

HIP results for `width = 32`

HIP results for `width = 64`

HIP HIP copied to clipboard

__sync codegen is incorrect wrt CUDA spec in ROCM 5.4.3

Summary

Culprit

CUDA

CUDA source code

Test bed:

CUDA results for width = 8

CUDA results for width = 32

HIP:

HIP source code:

Test bed:

HIP results for width = 8

HIP results for width = 32

HIP results for width = 64

HIP
HIP copied to clipboard

CUDA results for `width = 8`

CUDA results for `width = 32`

HIP results for `width = 8`

HIP results for `width = 32`

HIP results for `width = 64`