compute-runtime
compute-runtime copied to clipboard
zeCommandQueueCreate spontaneously segfault when creating one queue per thread
This is a cutdown case from https://github.com/CHIP-SPV/chip-spv/issues/146. When calling zeCommandQueueCreate from multiple threads, it spontaneously segfaults.
The call stack trace is:
Thread 101 "a.out" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff88404700 (LWP 7105)]
0x00007ffff741f5c4 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
(gdb) bt
#0 0x00007ffff741f5c4 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#1 0x00007ffff7153958 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2 0x00007ffff714f3c5 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3 0x00007ffff714f523 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#4 0x00007ffff714f69a in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#5 0x00007ffff71567be in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#6 0x00007ffff7ef56fd in zeCommandQueueCreate () from /usr/local/lib/libze_loader.so.1
#7 0x0000555555555479 in QueuePerThread () at test_queue.cc:28
#8 0x00005555555575aa in std::__invoke_impl<void, void ()()> (__f=@0x55555585b308: 0x55555555543a <Q
ueuePerThread()>) at /usr/include/c++/9/bits/invoke.h:60
#9 0x0000555555557542 in std::__invoke<void ()()> (__fn=@0x55555585b308: 0x55555555543a <QueuePerThr
ead()>) at /usr/include/c++/9/bits/invoke.h:95
#10 0x00005555555574d4 in std::thread::_Invoker<std::tuple<void ()()> >::_M_invoke<0ul> (this=0x55555
585b308) at /usr/include/c++/9/thread:244
#11 0x0000555555557491 in std::thread::_Invoker<std::tuple<void ()()> >::operator() (this=0x55555585b
308) at /usr/include/c++/9/thread:251
#12 0x0000555555557462 in std::thread::_State_impl<std::thread::Invoker<std::tuple<void (*)()> > >::
M_run (this=0x55555585b300) at /usr/include/c++/9/thread:195
#13 0x00007ffff7d9fde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007ffff7eb3609 in start_thread (arg=
Here is a reproducer:
#include <cassert>
#include <climits>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <level_zero/ze_api.h>
#include <iostream>
#include <vector>
#include <limits>
#include <thread>
#define check(ans) \
{ do_check((ans), __FILE__, __LINE__); }
void do_check(ze_result_t code, const char *file, int line) {
if (code != ZE_RESULT_SUCCESS) {
fprintf(stderr, "Failed: %d at %s %d\n", code, file, line);
exit(1);
}
}
ze_context_handle_t context;
ze_device_handle_t device;
ze_command_queue_desc_t cmdQueueDesc;
static void QueuePerThread()
{
ze_command_queue_handle_t command_queue;
check(zeCommandQueueCreate(context, device, &cmdQueueDesc, &command_queue));
}
int main()
{
// Initialize driver
check(zeInit(ZE_INIT_FLAG_GPU_ONLY));
// Retrieve driver
uint32_t driverCount = 0;
check(zeDriverGet(&driverCount, nullptr));
ze_driver_handle_t driverHandle;
check(zeDriverGet(&driverCount, &driverHandle));
ze_context_desc_t contextDesc = {};
check(zeContextCreate(driverHandle, &contextDesc, &context));
// Retrieve device
uint32_t deviceCount = 0;
check(zeDeviceGet(driverHandle, &deviceCount, nullptr));
// ze_device_handle_t device;
deviceCount = 1;
check(zeDeviceGet(driverHandle, &deviceCount, &device));
// Print some properties
ze_device_properties_t deviceProperties = {};
check(zeDeviceGetProperties(device, &deviceProperties));
// Create command queue
uint32_t numQueueGroups = 0;
check(zeDeviceGetCommandQueueGroupProperties(device, &numQueueGroups, nullptr));
if (numQueueGroups == 0)
{
return 1;
}
std::vector<ze_command_queue_group_properties_t> queueProperties(numQueueGroups);
check(zeDeviceGetCommandQueueGroupProperties(device, &numQueueGroups,
queueProperties.data()));
ze_command_queue_handle_t command_queue;
cmdQueueDesc = {};
for (uint32_t i = 0; i < numQueueGroups; i++)
{
if (queueProperties[i].flags & ZE_COMMAND_QUEUE_GROUP_PROPERTY_FLAG_COMPUTE)
{
cmdQueueDesc.ordinal = i;
}
}
cmdQueueDesc.index = 0;
cmdQueueDesc.mode = ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS;
// 1) Create qeueu with the main tread
QueuePerThread();
// 2) Create queue with a different thread
constexpr unsigned int MAX_THREAD_CNT = 100;
std::vector<std::thread> threads(MAX_THREAD_CNT);
for (auto &th : threads) {
th = std::thread(QueuePerThread);
}
for (auto& th : threads) {
th.detach();
}
}
To compile, use "g++ -O0 -g test_queue.cc -lze_loader -lpthread" or clang++.
thanks. Taking a look.
Any updates on this? @pengtu @JablonskiMateusz
@pvelesko: The bug was rejected by the driver team.
Quote of the analysis below:
This is a problem in your application.
You have the threads spawning here:
// 1) Create qeueu with the main thread QueuePerThread(); // 2) Create queue with a different thread constexpr unsigned int MAX_THREAD_CNT = 100; std::vectorstd::thread threads(MAX_THREAD_CNT); printf("spawning threads\n"); for (auto &th : threads) { th = std::thread(QueuePerThread); } Then, you are attempting to "detach" from the threads and have the main thread exit, ie:
for (auto& th : threads) { th.detach(); } printf("done\n"); This is not a legal usage of the resources because the L0 Driver/Device resources are "shared" between the threads and the L0 device and L0 driver resources are allocated at zeInit (which only occurs once per process, not once per thread).
What is occurring is that the main program is exiting before all the threads finished, with the device and driver resources freed while your threads were still running.
The correct way to write this program is to change to the following:
// 1) Create qeueu with the main thread QueuePerThread(); // 2) Create queue with a different thread constexpr unsigned int MAX_THREAD_CNT = 100; std::vectorstd::thread threads(MAX_THREAD_CNT); printf("spawning threads\n"); for (auto &th : threads) { th = std::thread(QueuePerThread); } printf("finished with spawning threads\n"); for (auto& th : threads) th.join(); printf("finished joining the threads\n"); printf("done\n");
The main program must not start releasing the resources for the devices and driver before the threads have finished otherwise this segfault is expected.
Basically, the L0 device resources were freed resulting in a thread attempting to create a queue without any data structures for the device being available ie:
process locked thread id read, 0x5575b7043e20 140001723520768 processLocked function freeing allocation for reuse <- The L0 device was freed from memory and thus the allocation list for reuse was removed. 140000951785216 140000926607104 processLocked function process locked thread id read, 0x5575b6fca3a8 140003760049024 process locked thread id read, 0x5575b7043e20 140001287329536140000012261120
processLocked function process locked thread id read, 0x5575b7043e20 140000020653824 processLocked function process locked thread id read, (nil) <- This "nil" should have not occurred, this means that the thread was still trying to allocate when the process exited removing the device resources. 140000951785216 processLocked function process locked thread id read, 0x5575b7043e20 140000934999808 ./test_queue.run: line 1: 3411044 Segmentation fault (core dumped) ./test_queue failed program
Please fix your test program. This is not a bug, but a misunderstanding of the functionality.