[Issue]: bh-hip hangs
Problem Description
In the following code snippet, the program seems to hang when executing the SummarizationKernel kernel.
for (step = 0; step < timesteps; step++) {
BoundingBoxKernel<<<blocks * FACTOR1, THREADS1>>>(
nnodes, nbodies, d_start, d_child, d_posMass, d_max, d_min,
d_radius, d_bottom, d_step, d_blkcnt );
ClearKernel1<<<blocks, 256>>>(nnodes, nbodies, d_child);
TreeBuildingKernel<<<blocks * FACTOR2, THREADS2>>>(
nnodes, nbodies, d_child, d_posMass, d_radius, d_bottom);
ClearKernel2<<<blocks, 256>>>(nnodes, d_start, d_posMass, d_bottom);
SummarizationKernel<<<blocks * FACTOR3, THREADS3>>>(
nnodes, nbodies, d_count, d_child, d_posMass, d_bottom);
SortKernel<<<blocks * FACTOR4, THREADS4>>>(
nnodes, nbodies, d_sort, d_count, d_start, d_child, d_bottom);
ForceCalculationKernel<<<blocks * FACTOR5, THREADS5>>>(
nnodes, nbodies, dthf, itolsq, epssq, d_sort, d_child, d_posMass,
d_vel, d_accVel, d_radius, d_step);
IntegrationKernel<<<blocks * FACTOR6, THREADS6>>>(
nbodies, dtime, dthf, d_posMass, d_vel, d_accVel);
}
hipDeviceSynchronize();
Operating System
22.04.5 LTS (Jammy Jellyfish)"
CPU
AMD Ryzen Threadripper 3970X 32-Core Processor
GPU
AMD Radeon RX 6900 XT
ROCm Version
rocm-6.3.2
ROCm Component
No response
Steps to Reproduce
-
hipify the cuda code in https://github.com/zjin-lcf/HeCBench/tree/master/src/bh-cuda/main.cu
-
build the HIP program: hipcc -O3 main.cu -o main
3 run the HIP program: ./main 10000 10
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
The program may generate memory access fault: https://github.com/zjin-lcf/HeCBench/issues/104
Hi @jinz2014. Internal ticket has been created to investigate this issue. Thanks!
Hi @jinz2014, the kernel appears to hang but it just takes some time to finish.
$ ./main 10000 10
ECL-BH v4.5
Copyright (c) 2010-2020 Texas State University
configuration: 10000 bodies, 10 time steps
Total kernel execution time: 558.8470 s
You may have run the program on an Nvidia GPU. Is the long time related to atomic operations ?
From the profiler results the SummarizationKernel is indeed taking significantly more time than other kernels. I'll try to compare with the cuda version when I can get my hands on an Nvidia machine.
This issue has been migrated to: https://github.com/ROCm/rocm-systems/issues/396