HIP [Issue]: bh-hip hangs

Problem Description

In the following code snippet, the program seems to hang when executing the SummarizationKernel kernel.

  for (step = 0; step < timesteps; step++) {
    BoundingBoxKernel<<<blocks * FACTOR1, THREADS1>>>(
        nnodes, nbodies, d_start, d_child, d_posMass, d_max, d_min,
        d_radius, d_bottom, d_step, d_blkcnt );

    ClearKernel1<<<blocks, 256>>>(nnodes, nbodies, d_child);

    TreeBuildingKernel<<<blocks * FACTOR2, THREADS2>>>(
        nnodes, nbodies, d_child, d_posMass, d_radius, d_bottom);

    ClearKernel2<<<blocks, 256>>>(nnodes, d_start, d_posMass, d_bottom);

    SummarizationKernel<<<blocks * FACTOR3, THREADS3>>>(
        nnodes, nbodies, d_count, d_child, d_posMass, d_bottom);

    SortKernel<<<blocks * FACTOR4, THREADS4>>>(
        nnodes, nbodies, d_sort, d_count, d_start, d_child, d_bottom);

    ForceCalculationKernel<<<blocks * FACTOR5, THREADS5>>>(
        nnodes, nbodies, dthf, itolsq, epssq, d_sort, d_child, d_posMass,
        d_vel, d_accVel, d_radius, d_step);

    IntegrationKernel<<<blocks * FACTOR6, THREADS6>>>(
        nbodies, dtime, dthf, d_posMass, d_vel, d_accVel);
  }
  hipDeviceSynchronize();

Operating System

22.04.5 LTS (Jammy Jellyfish)"

CPU

AMD Ryzen Threadripper 3970X 32-Core Processor

GPU

AMD Radeon RX 6900 XT

ROCm Version

rocm-6.3.2

ROCm Component

No response

Steps to Reproduce

hipify the cuda code in https://github.com/zjin-lcf/HeCBench/tree/master/src/bh-cuda/main.cu
build the HIP program: hipcc -O3 main.cu -o main

3 run the HIP program: ./main 10000 10

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

The program may generate memory access fault: https://github.com/zjin-lcf/HeCBench/issues/104

Feb 22 '25 16:02 jinz2014

Hi @jinz2014. Internal ticket has been created to investigate this issue. Thanks!

Feb 24 '25 15:02 ppanchad-amd

Hi @jinz2014, the kernel appears to hang but it just takes some time to finish.

$ ./main 10000 10
ECL-BH v4.5
Copyright (c) 2010-2020 Texas State University
configuration: 10000 bodies, 10 time steps
Total kernel execution time: 558.8470 s

Mar 05 '25 19:03 zichguan-amd

You may have run the program on an Nvidia GPU. Is the long time related to atomic operations ?

Mar 06 '25 21:03 jinz2014

From the profiler results the SummarizationKernel is indeed taking significantly more time than other kernels. I'll try to compare with the cuda version when I can get my hands on an Nvidia machine.

Mar 07 '25 16:03 zichguan-amd

This issue has been migrated to: https://github.com/ROCm/rocm-systems/issues/396

Aug 18 '25 18:08 systems-assistant[bot]