HIP
HIP copied to clipboard
__sync codegen is incorrect wrt CUDA spec in ROCM 5.4.3
Summary
HIP's implementation of __sync*
primitives don't conform to the CUDA specification of the functions. CUDA says:
"If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection)."
Source: https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#warp-shuffle-functions
However HIP's implementation is incorrect for width values other than 64
, for example:
# CUDA results with width = 8
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 0, shfl_val: 0
tid: 5, val: 5, shfl_src: 1, shfl_val: 1
tid: 6, val: 6, shfl_src: 2, shfl_val: 2
tid: 7, val: 7, shfl_src: 3, shfl_val: 3
# HIP results with width = 8
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 0, shfl_val: 8
tid: 5, val: 5, shfl_src: 1, shfl_val: 9
tid: 6, val: 6, shfl_src: 2, shfl_val: 10
tid: 7, val: 7, shfl_src: 3, shfl_val: 11
See CUDA source code & HIP source code for the source code.
Culprit
The reason behind the incorrect values is the LLVM IR code generation step, it generates incorrect code for __sync*
.
The LLVM IR code generated for __sync(value, threadId + delta, width)
is:
; i32 %2 = delta, i32 %3 = width, %5 = shuffled value
; %8 = add i32 %5, %2
%9 = tail call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
%10 = tail call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %9)
%11 = sub i32 0, %3
%12 = and i32 %10, %11
%13 = add nsw i32 %8, %12
%14 = shl i32 %13, 2
%15 = tail call i32 @llvm.amdgcn.ds.bpermute(i32 %14, i32 %5)
The correct implementation for __sync(%val, %srcLane, %width)
would be:
; srcModWidth = (srcLane % width) = (srcLane & (width - 1))
; baseLane = (laneId / width) * width = (laneId & (-width))
; lane = srcModWidth + baseLane
%0 = tail call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
%laneId = tail call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %0)
%1 = sub i32 %width, 1
%srcModWidth = and i32 %srcLane, %1
%2 = sub i32 0, %width
%baseLane = and i32 %laneId, %2
%lane = add nsw i32 %srcModWidth, %baseLane
%laneAddr = shl i32 %lane, 2
%__sync = tail call i32 @llvm.amdgcn.ds.bpermute(i32 %laneAddr, i32 %val)
CUDA
CUDA source code
#include <cuda_runtime.h>
#include <iostream>
#include <string>
using namespace std;
bool isPow2(size_t x) { return (x != 0) && ((x & (x - 1)) == 0); }
void __global__ shfl(int *x, int n, int delta, int width) {
auto tid = threadIdx.x;
if (tid < n) {
// Shufle `tid` to the thread with id: `(tid + delta) % width`.
auto val = __shfl_sync(0xFFFFFFFF, tid, tid + delta, width);
x[tid] = val;
}
}
int main(int argc, char **argv) {
const size_t n = 64;
const size_t delta = 4;
size_t width = 8; // width parameter.
if (argc > 1) {
auto tmp = stoul(argv[1]);
width = isPow2(tmp) ? tmp : width;
}
int *x{};
auto status = cudaMallocManaged(&x, n * sizeof(int));
shfl<<<1, n>>>(x, n, delta, width);
cudaDeviceSynchronize();
// Print results:
for (size_t tid = 0; tid < n; ++tid) {
// src id computation explanation:
// "If width is less than warpSize then each subsection of the warp behaves
// as a separate entity with a starting logical lane ID of 0. If srcLane is
// outside the range [0:width-1], the value returned corresponds to the
// value of var held by the srcLane modulo width (i.e. within the same
// subsection)."
// Source:
// https://docs.nvidia.com/cuda/archive/8.0/cuda-c-programming-guide/index.html#warp-shuffle-functions
auto src = ((tid + delta) % width) + (tid / width) * width;
// Print the thread id, the value to be shuffled, the source id of the
// incoming shuffled value, and the shuffled value.
cerr << "tid: " << tid << ", val: " << tid << ", shfl_src: " << src
<< ", shfl_val: " << x[tid] << endl;
if ((tid + 1) % width == 0)
cerr << endl;
}
cudaFree(x);
return 0;
}
Test bed:
- Machine:
- GPU: NVIDIA V100
- CUDA: 11.8
- Compile commnad:
nvcc shfl.cu -o shfl.exe
- Running command:
./shfl.exe <width>
CUDA results for width = 8
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 0, shfl_val: 0
tid: 5, val: 5, shfl_src: 1, shfl_val: 1
tid: 6, val: 6, shfl_src: 2, shfl_val: 2
tid: 7, val: 7, shfl_src: 3, shfl_val: 3
tid: 8, val: 8, shfl_src: 12, shfl_val: 12
tid: 9, val: 9, shfl_src: 13, shfl_val: 13
tid: 10, val: 10, shfl_src: 14, shfl_val: 14
tid: 11, val: 11, shfl_src: 15, shfl_val: 15
tid: 12, val: 12, shfl_src: 8, shfl_val: 8
tid: 13, val: 13, shfl_src: 9, shfl_val: 9
tid: 14, val: 14, shfl_src: 10, shfl_val: 10
tid: 15, val: 15, shfl_src: 11, shfl_val: 11
tid: 16, val: 16, shfl_src: 20, shfl_val: 20
tid: 17, val: 17, shfl_src: 21, shfl_val: 21
tid: 18, val: 18, shfl_src: 22, shfl_val: 22
tid: 19, val: 19, shfl_src: 23, shfl_val: 23
tid: 20, val: 20, shfl_src: 16, shfl_val: 16
tid: 21, val: 21, shfl_src: 17, shfl_val: 17
tid: 22, val: 22, shfl_src: 18, shfl_val: 18
tid: 23, val: 23, shfl_src: 19, shfl_val: 19
tid: 24, val: 24, shfl_src: 28, shfl_val: 28
tid: 25, val: 25, shfl_src: 29, shfl_val: 29
tid: 26, val: 26, shfl_src: 30, shfl_val: 30
tid: 27, val: 27, shfl_src: 31, shfl_val: 31
tid: 28, val: 28, shfl_src: 24, shfl_val: 24
tid: 29, val: 29, shfl_src: 25, shfl_val: 25
tid: 30, val: 30, shfl_src: 26, shfl_val: 26
tid: 31, val: 31, shfl_src: 27, shfl_val: 27
tid: 32, val: 32, shfl_src: 36, shfl_val: 36
tid: 33, val: 33, shfl_src: 37, shfl_val: 37
tid: 34, val: 34, shfl_src: 38, shfl_val: 38
tid: 35, val: 35, shfl_src: 39, shfl_val: 39
tid: 36, val: 36, shfl_src: 32, shfl_val: 32
tid: 37, val: 37, shfl_src: 33, shfl_val: 33
tid: 38, val: 38, shfl_src: 34, shfl_val: 34
tid: 39, val: 39, shfl_src: 35, shfl_val: 35
tid: 40, val: 40, shfl_src: 44, shfl_val: 44
tid: 41, val: 41, shfl_src: 45, shfl_val: 45
tid: 42, val: 42, shfl_src: 46, shfl_val: 46
tid: 43, val: 43, shfl_src: 47, shfl_val: 47
tid: 44, val: 44, shfl_src: 40, shfl_val: 40
tid: 45, val: 45, shfl_src: 41, shfl_val: 41
tid: 46, val: 46, shfl_src: 42, shfl_val: 42
tid: 47, val: 47, shfl_src: 43, shfl_val: 43
tid: 48, val: 48, shfl_src: 52, shfl_val: 52
tid: 49, val: 49, shfl_src: 53, shfl_val: 53
tid: 50, val: 50, shfl_src: 54, shfl_val: 54
tid: 51, val: 51, shfl_src: 55, shfl_val: 55
tid: 52, val: 52, shfl_src: 48, shfl_val: 48
tid: 53, val: 53, shfl_src: 49, shfl_val: 49
tid: 54, val: 54, shfl_src: 50, shfl_val: 50
tid: 55, val: 55, shfl_src: 51, shfl_val: 51
tid: 56, val: 56, shfl_src: 60, shfl_val: 60
tid: 57, val: 57, shfl_src: 61, shfl_val: 61
tid: 58, val: 58, shfl_src: 62, shfl_val: 62
tid: 59, val: 59, shfl_src: 63, shfl_val: 63
tid: 60, val: 60, shfl_src: 56, shfl_val: 56
tid: 61, val: 61, shfl_src: 57, shfl_val: 57
tid: 62, val: 62, shfl_src: 58, shfl_val: 58
tid: 63, val: 63, shfl_src: 59, shfl_val: 59
CUDA results for width = 32
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 8, shfl_val: 8
tid: 5, val: 5, shfl_src: 9, shfl_val: 9
tid: 6, val: 6, shfl_src: 10, shfl_val: 10
tid: 7, val: 7, shfl_src: 11, shfl_val: 11
tid: 8, val: 8, shfl_src: 12, shfl_val: 12
tid: 9, val: 9, shfl_src: 13, shfl_val: 13
tid: 10, val: 10, shfl_src: 14, shfl_val: 14
tid: 11, val: 11, shfl_src: 15, shfl_val: 15
tid: 12, val: 12, shfl_src: 16, shfl_val: 16
tid: 13, val: 13, shfl_src: 17, shfl_val: 17
tid: 14, val: 14, shfl_src: 18, shfl_val: 18
tid: 15, val: 15, shfl_src: 19, shfl_val: 19
tid: 16, val: 16, shfl_src: 20, shfl_val: 20
tid: 17, val: 17, shfl_src: 21, shfl_val: 21
tid: 18, val: 18, shfl_src: 22, shfl_val: 22
tid: 19, val: 19, shfl_src: 23, shfl_val: 23
tid: 20, val: 20, shfl_src: 24, shfl_val: 24
tid: 21, val: 21, shfl_src: 25, shfl_val: 25
tid: 22, val: 22, shfl_src: 26, shfl_val: 26
tid: 23, val: 23, shfl_src: 27, shfl_val: 27
tid: 24, val: 24, shfl_src: 28, shfl_val: 28
tid: 25, val: 25, shfl_src: 29, shfl_val: 29
tid: 26, val: 26, shfl_src: 30, shfl_val: 30
tid: 27, val: 27, shfl_src: 31, shfl_val: 31
tid: 28, val: 28, shfl_src: 0, shfl_val: 0
tid: 29, val: 29, shfl_src: 1, shfl_val: 1
tid: 30, val: 30, shfl_src: 2, shfl_val: 2
tid: 31, val: 31, shfl_src: 3, shfl_val: 3
tid: 32, val: 32, shfl_src: 36, shfl_val: 36
tid: 33, val: 33, shfl_src: 37, shfl_val: 37
tid: 34, val: 34, shfl_src: 38, shfl_val: 38
tid: 35, val: 35, shfl_src: 39, shfl_val: 39
tid: 36, val: 36, shfl_src: 40, shfl_val: 40
tid: 37, val: 37, shfl_src: 41, shfl_val: 41
tid: 38, val: 38, shfl_src: 42, shfl_val: 42
tid: 39, val: 39, shfl_src: 43, shfl_val: 43
tid: 40, val: 40, shfl_src: 44, shfl_val: 44
tid: 41, val: 41, shfl_src: 45, shfl_val: 45
tid: 42, val: 42, shfl_src: 46, shfl_val: 46
tid: 43, val: 43, shfl_src: 47, shfl_val: 47
tid: 44, val: 44, shfl_src: 48, shfl_val: 48
tid: 45, val: 45, shfl_src: 49, shfl_val: 49
tid: 46, val: 46, shfl_src: 50, shfl_val: 50
tid: 47, val: 47, shfl_src: 51, shfl_val: 51
tid: 48, val: 48, shfl_src: 52, shfl_val: 52
tid: 49, val: 49, shfl_src: 53, shfl_val: 53
tid: 50, val: 50, shfl_src: 54, shfl_val: 54
tid: 51, val: 51, shfl_src: 55, shfl_val: 55
tid: 52, val: 52, shfl_src: 56, shfl_val: 56
tid: 53, val: 53, shfl_src: 57, shfl_val: 57
tid: 54, val: 54, shfl_src: 58, shfl_val: 58
tid: 55, val: 55, shfl_src: 59, shfl_val: 59
tid: 56, val: 56, shfl_src: 60, shfl_val: 60
tid: 57, val: 57, shfl_src: 61, shfl_val: 61
tid: 58, val: 58, shfl_src: 62, shfl_val: 62
tid: 59, val: 59, shfl_src: 63, shfl_val: 63
tid: 60, val: 60, shfl_src: 32, shfl_val: 32
tid: 61, val: 61, shfl_src: 33, shfl_val: 33
tid: 62, val: 62, shfl_src: 34, shfl_val: 34
tid: 63, val: 63, shfl_src: 35, shfl_val: 35
HIP:
HIP source code:
For a explanation of the code see the section: CUDA source code;
#include <hip/hip_runtime.h>
#include <iostream>
#include <string>
using namespace std;
bool isPow2(size_t x) { return (x != 0) && ((x & (x - 1)) == 0); }
constexpr size_t width = 8; // width parameter.
void __global__ shfl(int *x, int n, int delta, int width) {
auto tid = threadIdx.x;
if (tid < n) {
// Closest HIP funciton to CUDA's __shfl_sync
auto val = __shfl(tid, tid + delta, width);
x[tid] = val;
}
}
int main(int argc, char **argv) {
const size_t n = 64;
const size_t delta = 4;
size_t width = 8; // width parameter.
if (argc > 1) {
auto tmp = stoul(argv[1]);
width = isPow2(tmp) ? tmp : width;
}
int *x{};
auto status = hipMallocManaged(&x, n * sizeof(int));
shfl<<<1, n>>>(x, n, delta, width);
hipDeviceSynchronize();
for (size_t tid = 0; tid < n; ++tid) {
auto src = ((tid + delta) % width) + (tid / width) * width;
cerr << "tid: " << tid << ", val: " << tid << ", shfl_src: " << src
<< ", shfl_val: " << x[tid] << endl;
if ((tid + 1) % width == 0)
cerr << endl;
}
hipFree(x);
return 0;
}
Test bed:
- Machine: Frontier ORNL
- GPU: AMD MI250X
- HIP: 5.4.3
- Compile commnad:
hipcc --offload-arch=gfx90a -x hip shfl.hip -o shfl.exe
- Running command:
srun -A $PROJ_ID -N 1 -t 00:00:05 ./shfl.exe <width>
HIP results for width = 8
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 0, shfl_val: 8
tid: 5, val: 5, shfl_src: 1, shfl_val: 9
tid: 6, val: 6, shfl_src: 2, shfl_val: 10
tid: 7, val: 7, shfl_src: 3, shfl_val: 11
tid: 8, val: 8, shfl_src: 12, shfl_val: 20
tid: 9, val: 9, shfl_src: 13, shfl_val: 21
tid: 10, val: 10, shfl_src: 14, shfl_val: 22
tid: 11, val: 11, shfl_src: 15, shfl_val: 23
tid: 12, val: 12, shfl_src: 8, shfl_val: 24
tid: 13, val: 13, shfl_src: 9, shfl_val: 25
tid: 14, val: 14, shfl_src: 10, shfl_val: 26
tid: 15, val: 15, shfl_src: 11, shfl_val: 27
tid: 16, val: 16, shfl_src: 20, shfl_val: 36
tid: 17, val: 17, shfl_src: 21, shfl_val: 37
tid: 18, val: 18, shfl_src: 22, shfl_val: 38
tid: 19, val: 19, shfl_src: 23, shfl_val: 39
tid: 20, val: 20, shfl_src: 16, shfl_val: 40
tid: 21, val: 21, shfl_src: 17, shfl_val: 41
tid: 22, val: 22, shfl_src: 18, shfl_val: 42
tid: 23, val: 23, shfl_src: 19, shfl_val: 43
tid: 24, val: 24, shfl_src: 28, shfl_val: 52
tid: 25, val: 25, shfl_src: 29, shfl_val: 53
tid: 26, val: 26, shfl_src: 30, shfl_val: 54
tid: 27, val: 27, shfl_src: 31, shfl_val: 55
tid: 28, val: 28, shfl_src: 24, shfl_val: 56
tid: 29, val: 29, shfl_src: 25, shfl_val: 57
tid: 30, val: 30, shfl_src: 26, shfl_val: 58
tid: 31, val: 31, shfl_src: 27, shfl_val: 59
tid: 32, val: 32, shfl_src: 36, shfl_val: 4
tid: 33, val: 33, shfl_src: 37, shfl_val: 5
tid: 34, val: 34, shfl_src: 38, shfl_val: 6
tid: 35, val: 35, shfl_src: 39, shfl_val: 7
tid: 36, val: 36, shfl_src: 32, shfl_val: 8
tid: 37, val: 37, shfl_src: 33, shfl_val: 9
tid: 38, val: 38, shfl_src: 34, shfl_val: 10
tid: 39, val: 39, shfl_src: 35, shfl_val: 11
tid: 40, val: 40, shfl_src: 44, shfl_val: 20
tid: 41, val: 41, shfl_src: 45, shfl_val: 21
tid: 42, val: 42, shfl_src: 46, shfl_val: 22
tid: 43, val: 43, shfl_src: 47, shfl_val: 23
tid: 44, val: 44, shfl_src: 40, shfl_val: 24
tid: 45, val: 45, shfl_src: 41, shfl_val: 25
tid: 46, val: 46, shfl_src: 42, shfl_val: 26
tid: 47, val: 47, shfl_src: 43, shfl_val: 27
tid: 48, val: 48, shfl_src: 52, shfl_val: 36
tid: 49, val: 49, shfl_src: 53, shfl_val: 37
tid: 50, val: 50, shfl_src: 54, shfl_val: 38
tid: 51, val: 51, shfl_src: 55, shfl_val: 39
tid: 52, val: 52, shfl_src: 48, shfl_val: 40
tid: 53, val: 53, shfl_src: 49, shfl_val: 41
tid: 54, val: 54, shfl_src: 50, shfl_val: 42
tid: 55, val: 55, shfl_src: 51, shfl_val: 43
tid: 56, val: 56, shfl_src: 60, shfl_val: 52
tid: 57, val: 57, shfl_src: 61, shfl_val: 53
tid: 58, val: 58, shfl_src: 62, shfl_val: 54
tid: 59, val: 59, shfl_src: 63, shfl_val: 55
tid: 60, val: 60, shfl_src: 56, shfl_val: 56
tid: 61, val: 61, shfl_src: 57, shfl_val: 57
tid: 62, val: 62, shfl_src: 58, shfl_val: 58
tid: 63, val: 63, shfl_src: 59, shfl_val: 59
HIP results for width = 32
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 8, shfl_val: 8
tid: 5, val: 5, shfl_src: 9, shfl_val: 9
tid: 6, val: 6, shfl_src: 10, shfl_val: 10
tid: 7, val: 7, shfl_src: 11, shfl_val: 11
tid: 8, val: 8, shfl_src: 12, shfl_val: 12
tid: 9, val: 9, shfl_src: 13, shfl_val: 13
tid: 10, val: 10, shfl_src: 14, shfl_val: 14
tid: 11, val: 11, shfl_src: 15, shfl_val: 15
tid: 12, val: 12, shfl_src: 16, shfl_val: 16
tid: 13, val: 13, shfl_src: 17, shfl_val: 17
tid: 14, val: 14, shfl_src: 18, shfl_val: 18
tid: 15, val: 15, shfl_src: 19, shfl_val: 19
tid: 16, val: 16, shfl_src: 20, shfl_val: 20
tid: 17, val: 17, shfl_src: 21, shfl_val: 21
tid: 18, val: 18, shfl_src: 22, shfl_val: 22
tid: 19, val: 19, shfl_src: 23, shfl_val: 23
tid: 20, val: 20, shfl_src: 24, shfl_val: 24
tid: 21, val: 21, shfl_src: 25, shfl_val: 25
tid: 22, val: 22, shfl_src: 26, shfl_val: 26
tid: 23, val: 23, shfl_src: 27, shfl_val: 27
tid: 24, val: 24, shfl_src: 28, shfl_val: 28
tid: 25, val: 25, shfl_src: 29, shfl_val: 29
tid: 26, val: 26, shfl_src: 30, shfl_val: 30
tid: 27, val: 27, shfl_src: 31, shfl_val: 31
tid: 28, val: 28, shfl_src: 0, shfl_val: 32
tid: 29, val: 29, shfl_src: 1, shfl_val: 33
tid: 30, val: 30, shfl_src: 2, shfl_val: 34
tid: 31, val: 31, shfl_src: 3, shfl_val: 35
tid: 32, val: 32, shfl_src: 36, shfl_val: 4
tid: 33, val: 33, shfl_src: 37, shfl_val: 5
tid: 34, val: 34, shfl_src: 38, shfl_val: 6
tid: 35, val: 35, shfl_src: 39, shfl_val: 7
tid: 36, val: 36, shfl_src: 40, shfl_val: 8
tid: 37, val: 37, shfl_src: 41, shfl_val: 9
tid: 38, val: 38, shfl_src: 42, shfl_val: 10
tid: 39, val: 39, shfl_src: 43, shfl_val: 11
tid: 40, val: 40, shfl_src: 44, shfl_val: 12
tid: 41, val: 41, shfl_src: 45, shfl_val: 13
tid: 42, val: 42, shfl_src: 46, shfl_val: 14
tid: 43, val: 43, shfl_src: 47, shfl_val: 15
tid: 44, val: 44, shfl_src: 48, shfl_val: 16
tid: 45, val: 45, shfl_src: 49, shfl_val: 17
tid: 46, val: 46, shfl_src: 50, shfl_val: 18
tid: 47, val: 47, shfl_src: 51, shfl_val: 19
tid: 48, val: 48, shfl_src: 52, shfl_val: 20
tid: 49, val: 49, shfl_src: 53, shfl_val: 21
tid: 50, val: 50, shfl_src: 54, shfl_val: 22
tid: 51, val: 51, shfl_src: 55, shfl_val: 23
tid: 52, val: 52, shfl_src: 56, shfl_val: 24
tid: 53, val: 53, shfl_src: 57, shfl_val: 25
tid: 54, val: 54, shfl_src: 58, shfl_val: 26
tid: 55, val: 55, shfl_src: 59, shfl_val: 27
tid: 56, val: 56, shfl_src: 60, shfl_val: 28
tid: 57, val: 57, shfl_src: 61, shfl_val: 29
tid: 58, val: 58, shfl_src: 62, shfl_val: 30
tid: 59, val: 59, shfl_src: 63, shfl_val: 31
tid: 60, val: 60, shfl_src: 32, shfl_val: 32
tid: 61, val: 61, shfl_src: 33, shfl_val: 33
tid: 62, val: 62, shfl_src: 34, shfl_val: 34
tid: 63, val: 63, shfl_src: 35, shfl_val: 35
HIP results for width = 64
tid: 0, val: 0, shfl_src: 4, shfl_val: 4
tid: 1, val: 1, shfl_src: 5, shfl_val: 5
tid: 2, val: 2, shfl_src: 6, shfl_val: 6
tid: 3, val: 3, shfl_src: 7, shfl_val: 7
tid: 4, val: 4, shfl_src: 8, shfl_val: 8
tid: 5, val: 5, shfl_src: 9, shfl_val: 9
tid: 6, val: 6, shfl_src: 10, shfl_val: 10
tid: 7, val: 7, shfl_src: 11, shfl_val: 11
tid: 8, val: 8, shfl_src: 12, shfl_val: 12
tid: 9, val: 9, shfl_src: 13, shfl_val: 13
tid: 10, val: 10, shfl_src: 14, shfl_val: 14
tid: 11, val: 11, shfl_src: 15, shfl_val: 15
tid: 12, val: 12, shfl_src: 16, shfl_val: 16
tid: 13, val: 13, shfl_src: 17, shfl_val: 17
tid: 14, val: 14, shfl_src: 18, shfl_val: 18
tid: 15, val: 15, shfl_src: 19, shfl_val: 19
tid: 16, val: 16, shfl_src: 20, shfl_val: 20
tid: 17, val: 17, shfl_src: 21, shfl_val: 21
tid: 18, val: 18, shfl_src: 22, shfl_val: 22
tid: 19, val: 19, shfl_src: 23, shfl_val: 23
tid: 20, val: 20, shfl_src: 24, shfl_val: 24
tid: 21, val: 21, shfl_src: 25, shfl_val: 25
tid: 22, val: 22, shfl_src: 26, shfl_val: 26
tid: 23, val: 23, shfl_src: 27, shfl_val: 27
tid: 24, val: 24, shfl_src: 28, shfl_val: 28
tid: 25, val: 25, shfl_src: 29, shfl_val: 29
tid: 26, val: 26, shfl_src: 30, shfl_val: 30
tid: 27, val: 27, shfl_src: 31, shfl_val: 31
tid: 28, val: 28, shfl_src: 32, shfl_val: 32
tid: 29, val: 29, shfl_src: 33, shfl_val: 33
tid: 30, val: 30, shfl_src: 34, shfl_val: 34
tid: 31, val: 31, shfl_src: 35, shfl_val: 35
tid: 32, val: 32, shfl_src: 36, shfl_val: 36
tid: 33, val: 33, shfl_src: 37, shfl_val: 37
tid: 34, val: 34, shfl_src: 38, shfl_val: 38
tid: 35, val: 35, shfl_src: 39, shfl_val: 39
tid: 36, val: 36, shfl_src: 40, shfl_val: 40
tid: 37, val: 37, shfl_src: 41, shfl_val: 41
tid: 38, val: 38, shfl_src: 42, shfl_val: 42
tid: 39, val: 39, shfl_src: 43, shfl_val: 43
tid: 40, val: 40, shfl_src: 44, shfl_val: 44
tid: 41, val: 41, shfl_src: 45, shfl_val: 45
tid: 42, val: 42, shfl_src: 46, shfl_val: 46
tid: 43, val: 43, shfl_src: 47, shfl_val: 47
tid: 44, val: 44, shfl_src: 48, shfl_val: 48
tid: 45, val: 45, shfl_src: 49, shfl_val: 49
tid: 46, val: 46, shfl_src: 50, shfl_val: 50
tid: 47, val: 47, shfl_src: 51, shfl_val: 51
tid: 48, val: 48, shfl_src: 52, shfl_val: 52
tid: 49, val: 49, shfl_src: 53, shfl_val: 53
tid: 50, val: 50, shfl_src: 54, shfl_val: 54
tid: 51, val: 51, shfl_src: 55, shfl_val: 55
tid: 52, val: 52, shfl_src: 56, shfl_val: 56
tid: 53, val: 53, shfl_src: 57, shfl_val: 57
tid: 54, val: 54, shfl_src: 58, shfl_val: 58
tid: 55, val: 55, shfl_src: 59, shfl_val: 59
tid: 56, val: 56, shfl_src: 60, shfl_val: 60
tid: 57, val: 57, shfl_src: 61, shfl_val: 61
tid: 58, val: 58, shfl_src: 62, shfl_val: 62
tid: 59, val: 59, shfl_src: 63, shfl_val: 63
tid: 60, val: 60, shfl_src: 0, shfl_val: 0
tid: 61, val: 61, shfl_src: 1, shfl_val: 1
tid: 62, val: 62, shfl_src: 2, shfl_val: 2
tid: 63, val: 63, shfl_src: 3, shfl_val: 3
Rabian, thanks for the report. Tagged one of my colleagues. Siuchi.
It seems to be a HIP header issue
https://github.com/ROCm-Developer-Tools/hipamd/blob/b242cbcaa52e1ee9293382996c6573d2b9f3601a/include/hip/amd_detail/amd_warp_functions.h#L90
index is calculated incorrectly.
It seems index should be:
int index = src_lane%width + (self & ~(width-1));
It never occurred to me check if these were implemented at a higher level, that's even easier to fix. I think this is issue is present in all __sync* intrinsics. I think the index in this case should be:
(srcLane & (width - 1)) + (laneId & (-width)) = (srcLane % width) + ((laneId / width) * width)
integer division is expansive. Since width must be power of 2, I assume laneId/width*width
is equivalent to laneId &~(width-1)
?
Yeah, I think (laneId & (-width)) = laneId &~(width-1) = laneId / width * width
.
@yxsamliu , has it been fixed?
it has been fixed https://github.com/ROCm/clr/commit/23e99dbb0785042c8f548f1dd7bbbb7554ed825b