pti-gpu
pti-gpu copied to clipboard
[zetracer] `zeCommandListAppendEventReset` bugs
Hi Anton,
@Kerilk and I are also developing a L0 tracer (https://github.com/argonne-lcf/THAPI). Recently we found that we don't handle the use case when a user resets an event with zeCommandListAppendEventReset
. It looks like your zetracer
has the same limitation (see the reproducer below).
In our tool supporting such use case will be expensive with the current L0 spec. We asked many times for L0 to add native callbacks (also on event change). This should greatly reduce the implementation complexity and overhead of tracing.
For now, our feedback didn't get a lot of traction. Maybe if two independent teams implementing tracing in two different source codes need callbacks, L0 will be more inclined to add callbacks...
So the question is, do you think having callbacks will help onetrace
?
Reproducer
ze.cpp
#include <fstream>
#include <iostream>
#include <level_zero/ze_api.h>
#include <limits>
#include <memory>
#define zeCall(myZeCall) \
do { \
if (myZeCall != ZE_RESULT_SUCCESS) { \
std::cout << "Error at " << #myZeCall << ": " << __FUNCTION__ << ": " << std::dec << __LINE__ << "\n"; \
std::terminate(); \
} \
} while (0);
void foo(ze_context_handle_t context, ze_device_handle_t device, ze_kernel_handle_t kernel1,ze_kernel_handle_t kernel2) {
// Some magic number
const int computeOrdinal = 0;
ze_command_queue_desc_t cmdQueueDesc = {};
cmdQueueDesc.mode = ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS;
cmdQueueDesc.ordinal = computeOrdinal;
cmdQueueDesc.index = 0;
ze_command_queue_handle_t queue;
zeCall(zeCommandQueueCreate(context, device, &cmdQueueDesc, &queue));
ze_command_list_desc_t listDesc = {};
listDesc.commandQueueGroupOrdinal = computeOrdinal;
ze_command_list_handle_t list;
zeCall(zeCommandListCreate(context, device, &listDesc, &list));
ze_group_count_t threadGroupCount = {};
threadGroupCount.groupCountX = 1u;
threadGroupCount.groupCountY = 1u;
threadGroupCount.groupCountZ = 1u;
// Create event pool
ze_event_pool_desc_t eventPoolDesc = {
ZE_STRUCTURE_TYPE_EVENT_POOL_DESC, NULL, ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP,
1 // One event on the pool
};
ze_event_pool_handle_t hEventPool;
zeCall(zeEventPoolCreate(context, &eventPoolDesc, 1, &device, &hEventPool));
ze_event_desc_t eventDesc = {
ZE_STRUCTURE_TYPE_EVENT_DESC, NULL,
0, // index
0, // no memory/cache coherency required on signal
0 // No need for memory/cache coherency on wait
};
ze_event_handle_t hEvent;
zeCall(zeEventCreate(hEventPool, &eventDesc, &hEvent));
#ifdef K1
std::cout<<"Sumiting K1" << std::endl;
zeCall(zeCommandListAppendLaunchKernel(list, kernel1, &threadGroupCount, hEvent, 0, nullptr));
#endif
zeCall(zeCommandListAppendBarrier(list, nullptr, 0, nullptr));
zeCall(zeCommandListAppendEventReset(list, hEvent));
zeCall(zeCommandListAppendBarrier(list, nullptr, 0, nullptr));
#ifdef K2
std::cout<<"Sumiting K2" << std::endl;
zeCall(zeCommandListAppendLaunchKernel(list, kernel2, &threadGroupCount, hEvent, 0, nullptr));
#endif
zeCall(zeCommandListClose(list));
zeCall(zeCommandQueueExecuteCommandLists(queue, 1, &list, nullptr));
zeCall(zeCommandQueueSynchronize(queue, std::numeric_limits<uint64_t>::max()));
}
int main(int argc, char *argv[]) {
zeCall(zeInit(ZE_INIT_FLAG_GPU_ONLY));
uint32_t driverCount = 0;
zeCall(zeDriverGet(&driverCount, nullptr));
ze_driver_handle_t driverHandle;
zeCall(zeDriverGet(&driverCount, &driverHandle));
ze_context_handle_t context;
ze_context_desc_t contextDesc = {};
zeCall(zeContextCreate(driverHandle, &contextDesc, &context));
// Get the root devices
uint32_t deviceCount = 0;
zeCall(zeDeviceGet(driverHandle, &deviceCount, nullptr));
if (deviceCount == 0) {
std::cout << "No devices found \n";
std::terminate();
}
ze_device_handle_t device;
deviceCount = 1;
zeCall(zeDeviceGet(driverHandle, &deviceCount, &device));
// Create kernel
std::string kernelFile = "kernel_XE_HP_COREcore.spv";
ze_module_format_t kernelFormat = ZE_MODULE_FORMAT_IL_SPIRV;
std::ifstream file(kernelFile, std::ios_base::in | std::ios_base::binary);
if (false == file.good()) {
std::cout << kernelFile << " file not found\n";
std::terminate();
}
uint32_t spirvSize = 0;
file.seekg(0, file.end);
spirvSize = static_cast<size_t>(file.tellg());
file.seekg(0, file.beg);
auto spirvModule = std::make_unique<char[]>(spirvSize);
file.read(spirvModule.get(), spirvSize);
ze_module_handle_t module;
ze_module_desc_t moduleDesc = {};
moduleDesc.format = kernelFormat;
moduleDesc.pInputModule = reinterpret_cast<const uint8_t *>(spirvModule.get());
moduleDesc.inputSize = spirvSize;
zeCall(zeModuleCreate(context, device, &moduleDesc, &module, nullptr));
ze_kernel_handle_t kernel1;
ze_kernel_desc_t kernelDesc1 = {};
kernelDesc1.pKernelName = "k1_noop";
zeCall(zeKernelCreate(module, &kernelDesc1, &kernel1));
zeCall(zeKernelSetGroupSize(kernel1, 256, 1, 1));
ze_kernel_handle_t kernel2;
ze_kernel_desc_t kernelDesc2 = {};
kernelDesc2.pKernelName = "k2_sleep";
zeCall(zeKernelCreate(module, &kernelDesc2, &kernel2));
zeCall(zeKernelSetGroupSize(kernel2, 256, 1, 1));
void *ptr1 = nullptr;
ze_device_mem_alloc_desc_t deviceDesc1 = {};
ze_host_mem_alloc_desc_t hostDesc1 = {};
zeCall(zeMemAllocShared(context, &deviceDesc1, &hostDesc1, 64, 0, device, &ptr1));
void *ptr2 = nullptr;
ze_device_mem_alloc_desc_t deviceDesc2 = {};
ze_host_mem_alloc_desc_t hostDesc2 = {};
zeCall(zeMemAllocShared(context, &deviceDesc2, &hostDesc2, 64, 0, device, &ptr2));
zeCall(zeKernelSetArgumentValue(kernel1, 0, 8, &ptr1));
zeCall(zeKernelSetArgumentValue(kernel2, 0, 8, &ptr2));
foo(context, device, kernel1, kernel2);
return 0;
}
kernel.cl
#define MAD_4(x, y) x = mad(y, x, y); y = mad(x, y, x); x = mad(y, x, y); y = mad(x, y, x);
#define MAD_16(x, y) MAD_4(x, y); MAD_4(x, y); MAD_4(x, y); MAD_4(x, y);
#define MAD_64(x, y) MAD_16(x, y); MAD_16(x, y); MAD_16(x, y); MAD_16(x, y);
__kernel void k1_noop(__global double *ptr) {
ptr[0] = 9;
}
__kernel void k2_sleep(__global double *ptr) {
double x = (double)get_local_id(1);
double y = (double)get_local_id(0);
for(int i=0; i<1024*64; i++)
{
MAD_64(x, y);
}
ptr[0] = y;
}
Compile
ocloc compile -file kernel.cl -device $FOO
icpx -lze_loader ze.cpp -Wall -DK1-o k1
icpx -lze_loader ze.cpp -Wall -DK2 -o k2
icpx -lze_loader ze.cpp -Wall -DK1 -DK2 -o k1k2
What we should expect?
We should expect k1
to show the kernel execution. But we don't see it
onetrace ./k1
=== API Timing Results: ===
Total Execution Time (ns): 186368143
Total API Time for L0 backend (ns): 185654838
== L0 Backend: ==
Function, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns)
zeModuleCreate, 1, 181219427, 97.61, 181219427, 181219427, 181219427
zeCommandListCreate, 1, 1629179, 0.88, 1629179, 1629179, 1629179
zeCommandQueueCreate, 1, 845735, 0.46, 845735, 845735, 845735
zeMemAllocShared, 2, 830426, 0.45, 415213, 178284, 652142
zeCommandQueueExecuteCommandLists, 1, 558334, 0.30, 558334, 558334, 558334
zeCommandQueueSynchronize, 1, 309534, 0.17, 309534, 309534, 309534
zeEventPoolCreate, 1, 177962, 0.10, 177962, 177962, 177962
zeEventCreate, 1, 53095, 0.03, 53095, 53095, 53095
zeCommandListAppendEventReset, 1, 7936, 0.00, 7936, 7936, 7936
zeKernelCreate, 2, 6768, 0.00, 3384, 946, 5822
zeCommandListAppendBarrier, 2, 5811, 0.00, 2905, 1573, 4238
zeKernelSetArgumentValue, 2, 5096, 0.00, 2548, 1103, 3993
zeKernelSetGroupSize, 2, 2299, 0.00, 1149, 224, 2075
zeContextCreate, 1, 1710, 0.00, 1710, 1710, 1710
zeCommandListClose, 1, 675, 0.00, 675, 675, 675
zeDeviceGet, 2, 374, 0.00, 187, 132, 242
zeDriverGet, 2, 275, 0.00, 137, 49, 226
zeInit, 1, 202, 0.00, 202, 202, 202
=== Device Timing Results: ===
Total Execution Time (ns): 186368143
Total Device Time for L0 backend (ns): 3680
== L0 Backend: ==
Kernel, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns)
zeCommandListAppendBarrier, 2, 3680, 100.00, 1840, 1280, 2400
And if we run k1
and k2
, we have timing for each kernel but they correspond only to k2
onetrace ./k1k2
=== API Timing Results: ===
Total Execution Time (ns): 253710707
Total API Time for L0 backend (ns): 252516062
== L0 Backend: ==
Function, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns)
zeModuleCreate, 1, 183044952, 72.49, 183044952, 183044952, 183044952
zeCommandQueueSynchronize, 1, 59487507, 23.56, 59487507, 59487507, 59487507
zeCommandQueueExecuteCommandLists, 1, 6742682, 2.67, 6742682, 6742682, 6742682
zeCommandListCreate, 1, 1637588, 0.65, 1637588, 1637588, 1637588
zeMemAllocShared, 2, 821432, 0.33, 410716, 291064, 530368
zeCommandQueueCreate, 1, 670053, 0.27, 670053, 670053, 670053
zeEventCreate, 1, 51961, 0.02, 51961, 51961, 51961
zeCommandListAppendLaunchKernel, 2, 16393, 0.01, 8196, 3280, 13113
zeEventPoolCreate, 1, 13748, 0.01, 13748, 13748, 13748
zeCommandListAppendEventReset, 1, 7173, 0.00, 7173, 7173, 7173
zeKernelCreate, 2, 6697, 0.00, 3348, 948, 5749
zeKernelSetArgumentValue, 2, 5205, 0.00, 2602, 1089, 4116
zeCommandListAppendBarrier, 2, 4852, 0.00, 2426, 1297, 3555
zeKernelSetGroupSize, 2, 2327, 0.00, 1163, 252, 2075
zeContextCreate, 1, 2051, 0.00, 2051, 2051, 2051
zeCommandListClose, 1, 617, 0.00, 617, 617, 617
zeDeviceGet, 2, 325, 0.00, 162, 113, 212
zeDriverGet, 2, 294, 0.00, 147, 44, 250
zeInit, 1, 205, 0.00, 205, 205, 205
=== Device Timing Results: ===
Total Execution Time (ns): 253710707
Total Device Time for L0 backend (ns): 8640
== L0 Backend: ==
Kernel, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns)
sleep, 1, 3200, 37.04, 3200, 3200, 3200
noop, 1, 3200, 37.04, 3200, 3200, 3200
zeCommandListAppendBarrier, 2, 2240, 25.93, 1120, 1120, 1120
Hope this help, Don't hesitate if you have any feedback.
Hi @TApplencourt, thanks for your feedback! Please give me some time to dive deeply into this - first of all I'd like to understand how general is this problem. But even right now I agree that current kernel tracing for Level Zero looks over-complicated, and we probably need to simplify things somehow.
Hi @TApplencourt,
The problem you've reported is known for us, and indeed we don't support such a case in our tools for now. The reason - we are not aware of any customer application that uses such an approach. Do you face with this case in real life, or it's just a reproducer?
Of cause the lack of support right now doesn't mean we don't plan to add it. To deal with such a case zeCommandListAppendQueryKernelTimestamps function should be used. Note also, that having callback in Level Zero does not resolve this issue by itself since it's more about current Level Zero design. But yes, it can make customers lives easier by moving all the problems inside Level Zero rather than having them outside.
Currently we are thinking about an approach similar to CUPTI Activity, where one can be subscribed to some event (e.g. kernel invocation) to be notified asynchronously (with a callback) if this event happened. Do you believe this is something you would prefer to use?
Do you face with this case in real life, or it's just a reproducer?
Just a reproducer (for now :D)
To deal with such a case zeCommandListAppendQueryKernelTimestamps function should be used.
Oh yes, this required a little infrastructure (allocating device memory, handling offset, ...) but totally feasible indeed!
Currently we are thinking about an approach similar to CUPTI Activity, where one can be subscribed to some event (e.g. kernel invocation) to be notified asynchronously (with a callback) if this event happened. Do you believe this is something you would prefer to use?
Something around those lines sounds good! But I'm not by any means an expert, I will let @Kerilk write a more insightful reply.