Problem Description

While using the rocprofv2 to collect performance counters like TCC_HIT_sum and TCC_MISS_sum on Vega 20, I found the value of TCC_HIT_sum is always 0 and TCC_MISS_sum shows some non-zero values, which I assume it works. If you can investigate why hit information is always 0 (including all hit information from 16 cache banks) and double check if the value of tcc miss is correct, that will be much appreciated. BTW, this problem exists regardless I collect the performance counters in ROCm version of 6.2.2-116 or in the docker image of 6.3.0.

Example output.csv returned from the profiler: Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,FlatVMemInsts,TCC_EA_RDREQ_sum,TCC_EA_RDREQ_32B_sum,TCC_HIT_sum,TCC_MISS_sum,TCC_MISS[12],TCC_MISS[13],TCC_MISS[14],TCC_MISS[15],TCC_HIT[0],TCC_HIT[1],TCC_HIT[2],TCC_HIT[3],TCC_HIT[4],TCC_HIT[5],TCC_HIT[6],TCC_HIT[7],TCC_HIT[8],TCC_HIT[9],TCC_HIT[10],TCC_HIT[11],TCC_HIT[12],TCC_HIT[13],TCC_HIT[14],TCC_HIT[15],TA_FLAT_WRITE_WAVEFRONTS_sum,TA_FLAT_READ_WAVEFRONTS_sum,TCC_EA_RDREQ[0],TCC_EA_RDREQ[1],TCC_EA_RDREQ[2],TCC_EA_RDREQ[3],TCC_EA_RDREQ[4],TCC_EA_RDREQ[5],TCC_EA_RDREQ[6],TCC_EA_RDREQ[7],TCC_EA_RDREQ[8],TCC_EA_RDREQ[9],TCC_EA_RDREQ[10],TCC_EA_RDREQ[11],TCC_EA_RDREQ[12],TCC_EA_RDREQ[13],TCC_EA_RDREQ[14],TCC_EA_RDREQ[15],TCC_MISS[0],TCC_MISS[1],TCC_MISS[2],TCC_MISS[3],TCC_MISS[4],TCC_MISS[5],TCC_MISS[6],TCC_MISS[7],TCC_MISS[8],TCC_MISS[9],TCC_MISS[10],TCC_MISS[11] 0,"kernel(int*) [clone .kd]",1,0,1,14761,14761,1,1,0,0,40,0,48,64,0x0,0x79eecbe84540,60.0000000000,68.0000000000,0.0000000000,0.0000000000,102.0000000000,4,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0000000000,60.0000000000,0,0,0,0,0,0,6,60,1,0,0,0,0,0,0,0,4,0,4,6,0,10,4,61,5,0,7,4

Operating System

Ubuntu 24.04.1 LTS

CPU

AMD Ryzen 9 3900X 12-Core Processor

GPU

gfx906 (AMD Vega 7nm also referred to as AMD Vega 20)

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Dec 20 '24 05:12 RookieT0T

Hi @RookieT0T. Internal ticket has been created to assist with your issue. Thanks!

Dec 20 '24 19:12 ppanchad-amd

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

Dec 31 '24 19:12 zichguan-amd

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

My workload has a bunch of flat_load_dwordx2 instructions contained in the asm volatile brackets (in the kernel function). The addresses specified in those load instructions should incur some cache hits. Also, "glc" flag is specified at the end of each load instruction to enforce the cache accesses bypass the L1 cache like TCP and then directly go to the L2 cache like TCC.

Example of kernel function with only one load instruction:

global void kernel(int * arr) { uint64_t a = 0;

asm volatile( "s_waitcnt vmcnt(0) & lgkmcnt(0)\n\t" "buffer_wbinvl1\n\t" "flat_load_dwordx2 %[out0], %[in1] glc\n\t" "s_waitcnt vmcnt(0) & lgkmcnt(0)\n\t" "s_nop 0\n\t"

      : [out0]"=v"(a)
      : [in1]"v"((uint64_t *)&arr[0])
      : "memory");

}

Dec 31 '24 20:12 RookieT0T

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

I am wondering if the flag "glc" is added, will the instruction cache hits also be part of the TCC hits sum reported by the profiler in addition to the data cache hits incurred by the program?

Jan 02 '25 02:01 RookieT0T

Are there any progress?

Jan 08 '25 23:01 RookieT0T

@RookieT0T please try using rocprofv3 from the new rocprofiler-sdk package instead of rocprofv2. rocprofv2 was always a beta and the design of the underlying rocprofiler v2 library was problematic and poorly tested. We plan to continue doing bug-fixes for rocprof for a while since rocprof reached production release but we do not plan to do this for rocprofv2 since v2 was scrapped.

v2 was rushed out the door and it shows. We took our time with rocprofv3 and rocprofiler-sdk and learned from the previous issues. Despite their “beta” status, they are already far more reliable and far better tested than any of their predecessors.

Jan 09 '25 03:01 jrmadsen

@RookieT0T please try using rocprofv3 from the new rocprofiler-sdk package instead of rocprofv2. rocprofv2 was always a beta and the design of the underlying rocprofiler v2 library was problematic and poorly tested. We plan to continue doing bug-fixes for rocprof for a while since rocprof reached production release but we do not plan to do this for rocprofv2 since v2 was scrapped.

v2 was rushed out the door and it shows. We took our time with rocprofv3 and rocprofiler-sdk and learned from the previous issues. Despite their “beta” status, they are already far more reliable and far better tested than any of their predecessors.

Hi there, I just tried rocprofv3 with this rocm docker image. Unfortunately, the cache hit info is still zero. "Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value" 1,1,1,1,1415542,1415542,1,"kernel(int*)",1,0,0,36,32,"SQ_WAVES",1 1,1,1,1,1415542,1415542,1,"kernel(int*)",1,0,0,36,32,"TCC_HIT_sum",0 1,1,1,1,1415542,1415542,1,"kernel(int*)",1,0,0,36,32,"TCC_MISS_sum",33

In terms of my workload, 17 cache line are accessed; c_0, c_1, c_2, c_3, c_4, c_5, c_6, c_7, c_8, c_9, c_10, c_11, c_12, c_13, c_14, c_15, c_15, and the last access to c_15 must be a cache hit, which is never reported by the profiler.

Jan 09 '25 15:01 RookieT0T

@RookieT0T please try using rocprofv3 from the new rocprofiler-sdk package instead of rocprofv2. rocprofv2 was always a beta and the design of the underlying rocprofiler v2 library was problematic and poorly tested. We plan to continue doing bug-fixes for rocprof for a while since rocprof reached production release but we do not plan to do this for rocprofv2 since v2 was scrapped.

v2 was rushed out the door and it shows. We took our time with rocprofv3 and rocprofiler-sdk and learned from the previous issues. Despite their “beta” status, they are already far more reliable and far better tested than any of their predecessors.

If you want to test my workload, this is one example. When I ran this example workload on the older rocm docker images like 4.0 or 3.7, the hit data was shown (the input.csv in the my linked repository is the output file returned by the profiler 3.7 running the example workload; one cache hit is expected). Hope this helps

Jan 10 '25 01:01 RookieT0T

Hi @RookieT0T I'm able to reproduce the issue with rocprof in your image. It is fixed in 6.3+. Can you confirm upgrading to 6.3 fixes the issue?

Jan 13 '25 13:01 ApoKalipse-V

Hi, I am glad to hear that. To reiterate my problem, the cache hit is never reported in docker images of rocm 6.3.0 and older versions except version 4.0 or 3.7 and rocm 6.2.2 that has been natively installed on my graphics card. The AMD graphics card that I am currently using has rocm 6.2.2-116 installed, and I tried to run my example program both on rocm 6.2.2-116 (no dockers used) and several docker images including rocm 6.3.1.

Therefore, based on your reply, I need to first upgrade my graphics card to rocm 6.3.1 and then use the upgraded rocprofiler to get the cache hits. I am wondering if the version of rocm installed on the graphics card and the version of rocm we are using on Linux (like these docker images I used previously) must be the same to get the cache hits. The final question is that you said 6.3+ in your reply, did you mean the incoming rocm 6.3.2 or any rocm version starting with 6.3.X like 6.3.0, 6.3.1, ...

Jan 13 '25 15:01 RookieT0T

The graphics driver used should make little difference for this issue since the underlying problem was a software ROCM related issue. Does this issue still exist if you run your application in a docker container with 6.3.1 (or in this container https://hub.docker.com/r/rocm/dev-ubuntu-22.04)?

Jan 14 '25 19:01 bwelton

Hi, all. I just tried the docker image with 6.3.1. Unfortunately, the result of using rocprofv3 showed that the cache hit was still 0. Have you tried my example workload on your machines and they showed the hit data?

Output: "Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp" 1,1,1,1,131,131,1,16,"kernel(int*)",1,0,0,36,32,"SQ_WAVES",1.000000,807736593474300,807736593509020 1,1,1,1,131,131,1,16,"kernel(int*)",1,0,0,36,32,"TCC_HIT_sum",0.00000000e+00,807736593474300,807736593509020 1,1,1,1,131,131,1,16,"kernel(int*)",1,0,0,36,32,"TCC_MISS_sum",36.000000,807736593474300,807736593509020

I am not sure if I missed something or set things incorrectly. When I typed "which rocprofv3" in the docker image, "/opt/rocm-6.3.1/bin/rocprofv3" was shown, and the command of docker run --rm -it --device /dev/kfd --device /dev/dri -v ./:/workdir --security-opt seccomp=unconfined --group-add 1 --group-add 2 rocm/dev-ubuntu-22.04:6.3.1-complete bash was used to initiate the docker image.

Jan 15 '25 18:01 RookieT0T

@RookieT0T I looked into your assembly (note: please format in a code block in the future) and it is unclear why you are expecting data to be in the L2 cache. Since there are not any previous accesses to int* arr, why would it be already loaded into the L2 cache? You won’t get an L2 cache hit for data read in from global memory.

Jan 15 '25 19:01 jrmadsen

It should get loaded into L2 as a result of the read from global memory but if our HW counted that as a L2 cache hit, what would be the point of that counter?

Jan 15 '25 19:01 jrmadsen

@RookieT0T I looked into your assembly (note: please format in a code block in the future) and it is unclear why you are expecting data to be in the L2 cache. Since there are not any previous accesses to int* arr, why would it be already loaded into the L2 cache? You won’t get an L2 cache hit for data read in from global memory.

The addresses specified in each load instruction are carefully calculated to ensure that the cache lines containing the data (array elements) map to the same cache set, enabling subsequent load instructions with calculated addresses to access the same cache line, resulting in cache hits. The expectation that the data will be in the L2 cache arises from the glc flag appended to each load instruction, which forces loads to bypass the TCP (L1 cache) and access the TCC (the L2 cache). Additionally, not every element in the array int* arr is accessed, as the access pattern is non-sequential and strategically designed to maximize cache performance.

In general, you can think the example workload as a trial of accessing a bunch of cache lines which map to the single cache set and then analyzing the cache hit/miss stats. This is why I need cache hit data. I believe there may be still something wrong with the rocprofiler or how I set things up locally because prior developers in my project group acquired both cache hit/miss data like roughly 1 year ago using rocprofiler. Then, our group upgraded the Linux version to Ubuntu 24, and then cache hit data was never reported. Sorry, this is as much as I can tell.

Jan 15 '25 23:01 RookieT0T

@RookieT0T If I use this code:

#ifdef NDEBUG
#    undef NDEBUG
#endif

#include <hip/hip_runtime.h>

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <stdexcept>
#include <vector>

#define HIP_API_CALL(CALL)                                                                         \
    {                                                                                              \
        hipError_t error_ = (CALL);                                                                \
        if(error_ != hipSuccess)                                                                   \
        {                                                                                          \
            fprintf(stderr,                                                                        \
                    "%s:%d :: HIP error : %s\n",                                                   \
                    __FILE__,                                                                      \
                    __LINE__,                                                                      \
                    hipGetErrorString(error_));                                                    \
            throw std::runtime_error("hip_api_call");                                              \
        }                                                                                          \
    }

__global__ void
kernel(uint64_t* arr)
{
    uint64_t a = 0;

    __asm volatile(R"(
        s_waitcnt vmcnt(0) & lgkmcnt(0)
        buffer_wbinvl1
        flat_load_dwordx2 %[out0], %[in1] glc
        s_waitcnt vmcnt(0) & lgkmcnt(0)
        s_nop 0
        )"
                   : [out0] "=v"(a)
                   : [in1] "v"((uint64_t*) &arr[0])
                   : "memory");

    // assert(a == 1);
}

int
main(int /*argc*/, char** /*argv*/)
{
    constexpr size_t length    = 4;
    constexpr size_t num_bytes = length * sizeof(uint64_t);

    uint64_t*  data = nullptr;
    auto out  = std::array<uint64_t, length>{};
    out.fill(1);

    HIP_API_CALL(hipMalloc(&data, num_bytes));
    HIP_API_CALL(hipMemset(data, 0, num_bytes));
    HIP_API_CALL(hipMemcpy(data, out.data(), num_bytes, hipMemcpyHostToDevice));

    kernel<<<1, 1>>>(data);
    HIP_API_CALL(hipDeviceSynchronize());
}

using rocprofv3 on my Vega20:

$ rocprofv3 --runtime-trace --pmc FlatVMemInsts TCC_EA_RDREQ_sum TCC_EA_RDREQ_32B_sum TCC_HIT_sum TCC_MISS_sum --kernel-include-regex "kernel.*" -d tcc-hit -o out -- ./bin/tcc-hit-assembly

I get TCC_HIT_sum = 0.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"FlatVMemInsts",1.000000,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_EA_RDREQ_sum",3.000000,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_HIT_sum",0.00000000e+00,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_MISS_sum",9.000000,239069576712464,239069576717424

However, if I change uint64_t a to be volatile:

    volatile uint64_t a = 0;

I get TCC_HIT_SUM = 2.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"FlatVMemInsts",3.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_sum",12.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_HIT_sum",2.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_MISS_sum",14.000000,239497802778690,239497802785090

Futhermore, if I uncomment the assert(a == 1) and run rocprofv3 again, I get TCC_HIT_SUM = 4.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"FlatVMemInsts",4.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_sum",10.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_HIT_sum",4.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_MISS_sum",13.000000,239625327580938,239625327585898

Jan 16 '25 05:01 jrmadsen

The previous TCC_HIT you were seeing is likely instruction fetch or similar. It should only hit on the second load:

__global__ void kernel(int * arr)
{
    uint64_t a = 0;

    asm volatile(
        "flat_load_dwordx2 %[out0], %[in1] glc\n"
        "s_waitcnt vmcnt(0)\n"
        : [out0]"=v"(a) : [in1]"v"((uint64_t *)&arr[0]) : "memory"
    );
    asm volatile(
        "flat_load_dwordx2 %[out0], %[in1] glc\n"
        "s_waitcnt vmcnt(0)\n"
        : [out0]"=v"(a) : [in1]"v"((uint64_t *)&arr[0]) : "memory"
    );
}

Jan 16 '25 16:01 ApoKalipse-V

@RookieT0T If I use this code:

#ifdef NDEBUG

undef NDEBUG

#endif

#include <hip/hip_runtime.h>

#include #include #include #include #include

#define HIP_API_CALL(CALL)
{
hipError_t error_ = (CALL);
if(error_ != hipSuccess)
{
fprintf(stderr,
"%s:%d :: HIP error : %s\n",
FILE,
LINE,
hipGetErrorString(error_));
throw std::runtime_error("hip_api_call");
}
}

global void kernel(uint64_t* arr) { uint64_t a = 0;

__asm volatile(R"(
    s_waitcnt vmcnt(0) & lgkmcnt(0)
    buffer_wbinvl1
    flat_load_dwordx2 %[out0], %[in1] glc
    s_waitcnt vmcnt(0) & lgkmcnt(0)
    s_nop 0
    )"
               : [out0] "=v"(a)
               : [in1] "v"((uint64_t*) &arr[0])
               : "memory");

// assert(a == 1);

}

int main(int /argc/, char** /argv/) { constexpr size_t length = 4; constexpr size_t num_bytes = length * sizeof(uint64_t);

uint64_t*  data = nullptr;
auto out  = std::array<uint64_t, length>{};
out.fill(1);

HIP_API_CALL(hipMalloc(&data, num_bytes));
HIP_API_CALL(hipMemset(data, 0, num_bytes));
HIP_API_CALL(hipMemcpy(data, out.data(), num_bytes, hipMemcpyHostToDevice));

kernel<<<1, 1>>>(data);
HIP_API_CALL(hipDeviceSynchronize());

} using rocprofv3 on my Vega20:

$ rocprofv3 --runtime-trace --pmc FlatVMemInsts TCC_EA_RDREQ_sum TCC_EA_RDREQ_32B_sum TCC_HIT_sum TCC_MISS_sum --kernel-include-regex "kernel.*" -d tcc-hit -o out -- ./bin/tcc-hit-assembly I get TCC_HIT_sum = 0.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"FlatVMemInsts",1.000000,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_EA_RDREQ_sum",3.000000,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_HIT_sum",0.00000000e+00,239069576712464,239069576717424
4,2,1,1,4165,4165,1,16,"kernel(unsigned long*)",1,0,0,4,16,"TCC_MISS_sum",9.000000,239069576712464,239069576717424

However, if I change uint64_t a to be volatile:

volatile uint64_t a = 0;

I get TCC_HIT_SUM = 2.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"FlatVMemInsts",3.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_sum",12.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_HIT_sum",2.000000,239497802778690,239497802785090
4,2,1,1,4927,4927,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_MISS_sum",14.000000,239497802778690,239497802785090

Futhermore, if I uncomment the assert(a == 1) and run rocprofv3 again, I get TCC_HIT_SUM = 4.0:

"Correlation_Id","Dispatch_Id","Agent_Id","Queue_Id","Process_Id","Thread_Id","Grid_Size","Kernel_Id","Kernel_Name","Workgroup_Size","LDS_Block_Size","Scratch_Size","VGPR_Count","SGPR_Count","Counter_Name","Counter_Value","Start_Timestamp","End_Timestamp"
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"FlatVMemInsts",4.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_32B_sum",0.00000000e+00,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_EA_RDREQ_sum",10.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_HIT_sum",4.000000,239625327580938,239625327585898
4,2,1,1,5005,5005,1,16,"kernel(unsigned long*)",1,0,88,40,48,"TCC_MISS_sum",13.000000,239625327580938,239625327585898

Thanks, I will take a look right now.

Jan 16 '25 22:01 RookieT0T

The previous TCC_HIT you were seeing is likely instruction fetch or similar. It should only hit on the second load:

global void kernel(int * arr) { uint64_t a = 0;
asm volatile(
    "flat_load_dwordx2 %[out0], %[in1] glc\n"
    "s_waitcnt vmcnt(0)\n"
    : [out0]"=v"(a) : [in1]"v"((uint64_t *)&arr[0]) : "memory"
);
asm volatile(
    "flat_load_dwordx2 %[out0], %[in1] glc\n"
    "s_waitcnt vmcnt(0)\n"
    : [out0]"=v"(a) : [in1]"v"((uint64_t *)&arr[0]) : "memory"
);
}

Do you mean the TCC_HIT acquired by my project group members one year ago is likely instruction fetch (instruction hit)?

Jan 16 '25 22:01 RookieT0T

Hi @RookieT0T, sorry for the late reply. Please use the newer rocprofiler-sdk. Based on previous comments, it doesn't look like a bug in the profiler. The TCC_HIT you saw with earlier ROCm versions may be caused by different things: compiler change, bug fixes, or instruction fetch. You can maybe check L1I activity since L1I is also backed by L2 cache.

Apr 29 '25 19:04 zichguan-amd

Hi @RookieT0T, sorry for the late reply. Please use the newer rocprofiler-sdk. Based on previous comments, it doesn't look like a bug in the profiler. The TCC_HIT you saw with earlier ROCm versions may be caused by different things: compiler change, bug fixes, or instruction fetch. You can maybe check L1I activity since L1I is also backed by L2 cache.

Thanks for your reply. I will check this later

Apr 30 '25 19:04 RookieT0T

Hi @RookieT0T, I'm closing this ticket for now, feel free to comment again if you still have questions/concerns and I can reopen the issue.

Jun 26 '25 19:06 zichguan-amd

[Issue]: Zero TCC_HIT_sum all the time

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

undef NDEBUG