rocprofiler icon indicating copy to clipboard operation
rocprofiler copied to clipboard

Rocprofiler does not allow to change metrics when using intercept mode

Open gcongiu opened this issue 3 years ago • 9 comments

Currently, rocprofiler does not allow to change metrics at runtime for intercepted kernels, so the following example won't work:

  rocprofiler_feature_t features[4];
  features[0].kind = ROCPROFILER_FEATURE_KIND_METRIC;
  features[0].name = "SQ_WAVES";
  unsigned feature_count = 1;

  init_intercept(features, feature_count);
  start_intercept();

  hipLaunchKernelGGL(vectoradd_float,
                  dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
                  dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
                  0, 0,
                  deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);

  hipDeviceSynchronize();
  stop_intercept();
  shutdown_intercept();

  features[1].kind = ROCPROFILER_FEATURE_KIND_METRIC;
  features[1].name = "SQ_INSTS_VALU";
  feature_count += 1;

  init_intercept(features, feature_count);
  start_intercept();

  hipLaunchKernelGGL(vectoradd_float,
                  dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
                  dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
                  0, 0,
                  deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);

  hipDeviceSynchronize();
  stop_intercept();
  shutdown_intercept();

Above, init_intercept() initializes the queue callbacks for intercept mode and calls rocprofiler_set_queue_callbacks(). start_intercept() and stop_intercept() call rocprofiler_start_queue_callbacks() and rocprofiler_stop_queue_callbacks(), respectively, and shutdown_intercept() calls rocprofiler_remove_queue_callbacks().

Rocprofiler does not allow users to call rocprofiler_set_queue_callbacks() if this has been already called. Thus, the second call to init_intercept() in the example code above causes the following error message:

> error(4096) "SetCallbacks(), reassigning queue callbacks - not supported”

The ability to change metrics at runtime (while using intercept mode) is a feature highly desirable for tools like PAPI. With the current implementation of rocprofiler PAPI users would have to define metrics once and have them applied to all the kernels being intercepted.

gcongiu avatar Jan 24 '22 08:01 gcongiu

Example source code: vectoradd_hip.cpp.txt

gcongiu avatar Jan 25 '22 11:01 gcongiu

Could you please check small fix? https://github.com/ROCm-Developer-Tools/rocprofiler/pull/77 Value of feature_count in https://github.com/ROCm-Developer-Tools/rocprofiler/files/7933103/vectoradd_hip.cpp.txt. can't be greater than actual number of features in vector.

kikimych avatar Feb 21 '22 18:02 kikimych

@kikimych I will check it. Thanks

gcongiu avatar Mar 01 '22 09:03 gcongiu

Previously I was getting:

LD_LIBRARY_PATH=$HOME/rocm-4.5.0/rocprofiler/lib:$LD_LIBRARY_PATH ./vectoradd_hip.exe
 System minor 0
 System major 9
 agent prop name
hip Device prop succeeded
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f051040a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67149) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
error(4096) "SetCallbacks(), reassigning queue callbacks - not supported"
HSA_STATUS_ERROR: A generic error has occurred.
Aborted (core dumped)

Now, with the fix that sets callback_data_ = NULL I am getting:

$ LD_LIBRARY_PATH=$HOME/rocm-4.5.0/rocprofiler/lib:$LD_LIBRARY_PATH ./vectoradd_hip.exe
 System minor 0
 System major 9
 agent prop name
hip Device prop succeeded
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
kernel symbol(0x7f5dfe40a800) name("vectoradd_float(float*, float const*, float const*, int, int) [clone .kd]") tid(67314) queue-id(0) gpu-id(0) > SQ_WAVES = (16384)
vectoradd_hip.exe: vectoradd_hip.cpp:160: hsa_status_t _rocp_dispatch_callback(const rocprofiler_callback_data_t *, void *, rocprofiler_group_t *): Assertion `status == HSA_STATUS_SUCCESS' failed.
Aborted (core dumped)

gcongiu avatar Mar 01 '22 11:03 gcongiu

You are adding 2 features to profile, but setting feature count to 1. I have fixed it locally, but forgot to report. Could you please check with feature_count set to number of actual features?

kikimych avatar Mar 25 '22 14:03 kikimych

The feature_count=1 was intended not a typo. However, if I set it to 2 instead, start/stop monitoring, then change it back to 1 and start/stop monitoring again, I get the same error(4096) "SetCallbacks(), reassigning queue callbacks - not supported" I used to see.

gcongiu avatar Mar 30 '22 11:03 gcongiu

My bad. It's not related to feature_count. You have a typo in metric name. "SQ_INSTS_VALU", not "SQ_INST_VALU".

kikimych avatar Mar 31 '22 13:03 kikimych

[Update] Still not resolved with ROCm 5.4.x:

$ ./kernel-intercept
Tool lib "/home/gcongiu/rocprofiler/build/librocprofiler64.so" failed to load.
 System minor 0
 System major 9
 agent prop name AMD Instinct MI210
hip Device prop succeeded
error(4096) "SetCallbacks(), reassigning queue callbacks - not supported"
HSA_STATUS_ERROR: A generic error has occurred.
Aborted (core dumped)

gcongiu avatar Apr 03 '23 07:04 gcongiu

Hi @gcongiu, apologies for the lack of response. Could you please check if the issue persists with ROCm 6.2.0? If so, we can continue to investigate the issue from there.

harkgill-amd avatar Aug 13 '24 20:08 harkgill-amd

Closing this issue out. @gcongiu , if you are still encountering this error with the latest ROCm 6.2.1 release, please leave a comment and I will re-open this ticket. Thanks!

harkgill-amd avatar Sep 26 '24 19:09 harkgill-amd