rocprofiler
rocprofiler copied to clipboard
Rocprofiler does not allow to change metrics when using intercept mode
Currently, rocprofiler does not allow to change metrics at runtime for intercepted kernels, so the following example won't work:
rocprofiler_feature_t features[4];
features[0].kind = ROCPROFILER_FEATURE_KIND_METRIC;
features[0].name = "SQ_WAVES";
unsigned feature_count = 1;
init_intercept(features, feature_count);
start_intercept();
hipLaunchKernelGGL(vectoradd_float,
dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
0, 0,
deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);
hipDeviceSynchronize();
stop_intercept();
shutdown_intercept();
features[1].kind = ROCPROFILER_FEATURE_KIND_METRIC;
features[1].name = "SQ_INSTS_VALU";
feature_count += 1;
init_intercept(features, feature_count);
start_intercept();
hipLaunchKernelGGL(vectoradd_float,
dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
0, 0,
deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);
hipDeviceSynchronize();
stop_intercept();
shutdown_intercept();
Above, init_intercept()
initializes the queue callbacks for intercept mode and calls rocprofiler_set_queue_callbacks()
. start_intercept()
and stop_intercept()
call rocprofiler_start_queue_callbacks()
and rocprofiler_stop_queue_callbacks()
, respectively, and shutdown_intercept()
calls rocprofiler_remove_queue_callbacks()
.
Rocprofiler does not allow users to call rocprofiler_set_queue_callbacks()
if this has been already called. Thus, the second call to init_intercept()
in the example code above causes the following error message:
> error(4096) "SetCallbacks(), reassigning queue callbacks - not supported”
The ability to change metrics at runtime (while using intercept mode) is a feature highly desirable for tools like PAPI. With the current implementation of rocprofiler PAPI users would have to define metrics once and have them applied to all the kernels being intercepted.
Example source code: vectoradd_hip.cpp.txt
Could you please check small fix? https://github.com/ROCm-Developer-Tools/rocprofiler/pull/77 Value of feature_count in https://github.com/ROCm-Developer-Tools/rocprofiler/files/7933103/vectoradd_hip.cpp.txt. can't be greater than actual number of features in vector.
@kikimych I will check it. Thanks
Previously I was getting:
LD_LIBRARY_PATH=$HOME/rocm-4.5.0/rocprofiler/lib:$LD_LIBRARY_PATH ./vectoradd_hip.exe
System minor 0
System major 9
agent prop name
hip Device prop succeeded
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
error(4096) "SetCallbacks(), reassigning queue callbacks - not supported"
HSA_STATUS_ERROR: A generic error has occurred.
Aborted (core dumped)
Now, with the fix that sets callback_data_ = NULL
I am getting:
$ LD_LIBRARY_PATH=$HOME/rocm-4.5.0/rocprofiler/lib:$LD_LIBRARY_PATH ./vectoradd_hip.exe
System minor 0
System major 9
agent prop name
hip Device prop succeeded
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
vectoradd_hip.exe: vectoradd_hip.cpp:160: hsa_status_t _rocp_dispatch_callback(const rocprofiler_callback_data_t *, void *, rocprofiler_group_t *): Assertion `status == HSA_STATUS_SUCCESS' failed.
Aborted (core dumped)
You are adding 2 features to profile, but setting feature count to 1. I have fixed it locally, but forgot to report. Could you please check with feature_count set to number of actual features?
The feature_count=1
was intended not a typo. However, if I set it to 2 instead, start/stop monitoring, then change it back to 1 and start/stop monitoring again, I get the same error(4096) "SetCallbacks(), reassigning queue callbacks - not supported"
I used to see.
My bad. It's not related to feature_count. You have a typo in metric name. "SQ_INSTS_VALU", not "SQ_INST_VALU".
[Update] Still not resolved with ROCm 5.4.x:
$ ./kernel-intercept
Tool lib "/home/gcongiu/rocprofiler/build/librocprofiler64.so" failed to load.
System minor 0
System major 9
agent prop name AMD Instinct MI210
hip Device prop succeeded
error(4096) "SetCallbacks(), reassigning queue callbacks - not supported"
HSA_STATUS_ERROR: A generic error has occurred.
Aborted (core dumped)
Hi @gcongiu, apologies for the lack of response. Could you please check if the issue persists with ROCm 6.2.0? If so, we can continue to investigate the issue from there.
Closing this issue out. @gcongiu , if you are still encountering this error with the latest ROCm 6.2.1 release, please leave a comment and I will re-open this ticket. Thanks!