compute-runtime icon indicating copy to clipboard operation
compute-runtime copied to clipboard

Clarification on `zetMetricStreamerReadData` Behavior for Non-Overlapping Kernel Profiling

Open yuninxia opened this issue 1 year ago • 2 comments

Environment

  • Hardware: Aurora
  • Intel Compute Runtime Version: 24.35.30872.22

Context

I'm developing a profiler for SYCL offload programs. My approach involves serializing kernel launches using zeEventHostSynchronize to ensure only one kernel is offloaded to the Intel GPU device at a time. For each kernel, I use a profiling thread to read stall sampling data using zetMetricStreamerReadData.

Current Implementation

Currently, after each kernel execution, I collect and process the data. To ensure non-overlapping stall samples between kernels, I've implemented a manual buffer flushing function zeroFlushStreamerBuffer(streamer, desc). This function closes the current streamer and opens a new one.

void zeroFlushStreamerBuffer(zet_metric_streamer_handle_t& streamer, ZeDeviceDescriptor* desc)
{
    ze_result_t status = ZE_RESULT_SUCCESS;
    // Close the old streamer
    status = zetMetricStreamerClose(streamer);
    level0_check_result(status, **LINE**);
    // Open a new streamer
    uint32_t interval = 500000; // ns
    zet_metric_streamer_desc_t streamer_desc = {ZET_STRUCTURE_TYPE_METRIC_STREAMER_DESC, nullptr, max_metric_samples, interval};
    status = zetMetricStreamerOpen(desc->context_, desc->device_, desc->metric_group_, &streamer_desc, nullptr, &streamer);
    if (status != ZE_RESULT_SUCCESS) {
        std::cerr << "[ERROR] Failed to open metric streamer (" << status << "). The sampling interval might be too small." << std::endl;
        streamer = nullptr;
        return;
    }
    if (streamer_desc.notifyEveryNReports > max_metric_samples) {
        max_metric_samples = streamer_desc.notifyEveryNReports;
    }
}

Current Implementation Details

To provide more context, here's the main profiling loop where zeroFlushStreamerBuffer is used:

void 
ZeMetricProfiler::RunProfilingLoop
(
  ZeDeviceDescriptor* desc,
  zet_metric_streamer_handle_t& streamer
)
{
  std::vector<uint8_t> raw_metrics(MAX_METRIC_BUFFER + 512);
  desc->profiling_state_.store(PROFILER_ENABLED, std::memory_order_release);
  ze_result_t status;
  
  while (desc->profiling_state_.load(std::memory_order_acquire) != PROFILER_DISABLED) {
    // Wait for the kernel to start running
    while (true) {
      status = zeEventHostSynchronize(desc->serial_kernel_start_, 50000000);
      if (status == ZE_RESULT_SUCCESS) {
        break;
      }
      // Handle case where kernel execution is extremely short:
      // In such cases, the kernel might finish before zeEventHostSynchronize can detect the start event.
      // Without this check, a deadlock could occur:
      // - The Profiling thread would keep waiting for the start event (which has already been reset).
      // - The App thread would be waiting for the Profiling thread to complete data processing.
      // kernel_started_ allows Profiling thread to proceed, avoiding deadlock.
      if (desc->kernel_started_.load(std::memory_order_acquire)) {
        break;
      }
      if (desc->profiling_state_.load(std::memory_order_acquire) == PROFILER_DISABLED) {
        return;
      }
    }
    // Kernel is running, enter sampling loop
    while (true) {
      // Update correlation ID
      gpu_correlation_channel_receive(1, UpdateCorrelationID, desc);
      // Wait for the next interval
      status = zeEventHostSynchronize(desc->serial_kernel_end_, 5000);
      if (status == ZE_RESULT_SUCCESS) {
        break;
      }
      CollectAndProcessMetrics(desc, streamer, raw_metrics);
    }
    // Kernel has finished, perform final sampling and cleanup
    CollectAndProcessMetrics(desc, streamer, raw_metrics);
    // FIXME(Yuning): may need a better way to flush the streamer buffer without repeatedly closing and reopening the streamer
    zeroFlushStreamerBuffer(streamer, desc);
    desc->running_kernel_ = nullptr;
    desc->kernel_started_.store(false, std::memory_order_release);
    
    // Notify the app thread that data processing is complete
    status = zeEventHostSignal(desc->serial_data_ready_);
    level0_check_result(status, **LINE**);
  }
}

This code demonstrates how we currently handle metric collection for each kernel execution, including the use of zeroFlushStreamerBuffer to attempt non-overlapping data collection between kernels.

Questions

  1. Data Overlap: When collecting data for a kernel after its execution, is there a possibility that the data from zetMetricStreamerReadData includes stall samples from the previous kernel? My goal is to obtain non-overlapping stall samples for each kernel to enable fine-grained performance analysis.

  2. API Enhancement: If my understanding is correct, would it be possible to provide a levelzero API for flushing the metrics streamer, such as zetMetricStreamerFlushData? This could potentially be more efficient than the current zeroFlushStreamerBuffer implementation.

  3. Clarification: If my understanding is incorrect, could you please confirm that each call to zetMetricStreamerReadData always returns non-overlapping data? This would allow me to remove the zeroFlushStreamerBuffer function, potentially improving performance.

Request

I would greatly appreciate clarification on the behavior of zetMetricStreamerReadData in this context and any guidance on the best practices for ensuring non-overlapping metric collection between kernel executions.

yuninxia avatar Oct 10 '24 10:10 yuninxia