libedgetpu icon indicating copy to clipboard operation
libedgetpu copied to clipboard

Possible deadlock

Open YijinLiu opened this issue 4 years ago • 23 comments

My program is experiencing deadlock in libedgetpu occasionally. Following is the stack trace.

Thread 58 (Thread 0x7fff7d7fa700 (LWP 24554)):
#0  0x00007ffff75ebad3 in futex_wait_cancelable (private=, expected=0, futex_word=0x7fff7d7f9170) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fff7d7f9120, cond=0x7fff7d7f9148) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fff7d7f9148, mutex=0x7fff7d7f9120) at pthread_cond_wait.c:655
#3  0x00007ffff78b48bc in std::condition_variable::wait(std::unique_lock<:mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffea8f9a0e in std::condition_variable::wait<:darwinn::blockingcounter::wait> >(std::unique_lock<:mutex> &, platforms::darwinn::BlockingCounter::) (this=0x7fff7d7f9148, 
    __lock=..., __p=...) at /usr/include/c++/5/condition_variable:98
#5  0x00007fffea8f99c4 in platforms::darwinn::BlockingCounter::Wait (this=0x7fff7d7f9120) at port/blocking_counter.cc:42
#6  0x00007fffea8ce763 in platforms::darwinn::driver::Driver::Execute (this=0xc1a6700, request=std::shared_ptr<:darwinn::api::request> (empty) = {...}) at driver/driver.cc:593
#7  0x00007fffea8b9f0a in platforms::darwinn::tflite::EdgeTpuDriverWrapper::InvokeExecutable (this=0xbdf57a0, context=0xc087628, node=0xc13f088) at tflite/edgetpu_context_direct.cc:240
#8  0x00007fffea8ac653 in platforms::darwinn::tflite::(anonymous namespace)::CustomOpInvoke (context=0xc087628, node=0xc13f088) at tflite/custom_op_direct.cc:122
#9  0x00000000068df998 in tflite::impl::Subgraph::Invoke() ()
#10 0x0000000006830e87 in tflite::impl::Interpreter::Invoke() ()

Thread 32 (Thread 0x7fffb4cf5700 (LWP 24515)):
#0  0x00007ffff75ebad3 in futex_wait_cancelable (private=, expected=0, futex_word=0xc1a6a14) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0xc19a4b0, cond=0xc1a69e8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0xc1a69e8, mutex=0xc19a4b0) at pthread_cond_wait.c:655
#3  0x00007ffff78b48bc in std::condition_variable::wait(std::unique_lock<:mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffea87517f in std::_V2::condition_variable_any::wait<:darwinn::driver:: namespace> (this=0xc1a69e8, __lock=...) at /usr/include/c++/5/condition_variable:250
#5  0x00007fffea871789 in platforms::darwinn::driver::UsbDriver::WorkerThreadFunc (this=0xc1a6700) at driver/usb/usb_driver.cc:1322
#6  0x00007fffea8728c7 in platforms::darwinn::driver::UsbDriver::::operator()(void) const (__closure=0xc1cb358) at driver/usb/usb_driver.cc:1604
#7  0x00007fffea878834 in std::_Bind_simple<:darwinn::driver::usbdriver::doopen>()>::_M_invoke(std::_Index_tuple) (this=0xc1cb358) at /usr/include/c++/5/functional:1531
#8  0x00007fffea8787b5 in std::_Bind_simple<:darwinn::driver::usbdriver::doopen>()>::operator()(void) (this=0xc1cb358) at /usr/include/c++/5/functional:1520
#9  0x00007fffea87875c in std::thread::_Impl<:_bind_simple>()> >::_M_run(void) (this=0xc1cb340) at /usr/include/c++/5/thread:115
#10 0x00007ffff78baa50 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007ffff75e56db in start_thread (arg=0x7fffb4cf5700) at pthread_create.c:463
#12 0x00007fffea0d171f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I use one thread to run the inference. This only happens about once per week. I haven't find a way to repro it so far.

YijinLiu avatar Dec 16 '20 09:12 YijinLiu

Hello, can you describe what exactly happens and how is your program designed?

Namburger avatar Dec 16 '20 12:12 Namburger

We run a SSD model using edgetpu. A single thread receives images and run the model. Once a while, the thread stuck forever. The initial post has the stack trace. Unfortunately, I have no luck finding a way to repro it so far. Let me know if you need more information.

YijinLiu avatar Dec 16 '20 21:12 YijinLiu

@YijinLiu I see, may I ask if this is over usb3 or usb2?

Namburger avatar Dec 21 '20 16:12 Namburger

It's usb3. Thanks.

YijinLiu avatar Dec 22 '20 01:12 YijinLiu

Feel free to reopen if this is still an issue.

manoj7410 avatar Jul 05 '21 09:07 manoj7410

Yes, it is. This is awkward.. I answered one of your questions and wait for several months. Only to find you closed it w/o giving any reason. I hope this is not how you guys usually interact with developers.

Anyway, I have bit more information:

I turned on logging and got a lot of errors like:

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]

And the thread was stuck here:

#0  futex_wait_cancelable (private=, expected=0, futex_word=0xc95660c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xcf13f90, cond=0xc9565e0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0xc9565e0, mutex=0xcf13f90) at pthread_cond_wait.c:638
#3  0x00007ffff7a610fc in std::condition_variable::wait(std::unique_lock<:mutex>&) () from bin/libstdc++.so.6
#4  0x00007fffeb8fca38 in std::_V2::condition_variable_any::wait<:darwinn::driver:: namespace> (this=0xc9565e0, __lock=...) at /usr/include/c++/7/condition_variable:263
#5  0x00007fffeb8f82cc in platforms::darwinn::driver::UsbDriver::WorkerThreadFunc (this=0xc956300) at driver/usb/usb_driver.cc:1320
#6  0x00007fffeb8f9a15 in platforms::darwinn::driver::UsbDriver::::operator()(void) const (__closure=0xce69628) at driver/usb/usb_driver.cc:1599
#7  0x00007fffeb8fe28d in std::__invoke_impl >(std::__invoke_other, platforms::darwinn::driver::UsbDriver:: &&) (__f=...) at /usr/include/c++/7/bits/invoke.h:60
#8  0x00007fffeb8fcbaf in std::__invoke<:darwinn::driver::usbdriver::doopen> >(platforms::darwinn::driver::UsbDriver:: &&) (__fn=...) at /usr/include/c++/7/bits/invoke.h:95
#9  0x00007fffeb8ffba6 in std::thread::_Invoker<:tuple> > >::_M_invoke(std::_Index_tuple) (this=0xce69628) at /usr/include/c++/7/thread:234
#10 0x00007fffeb8ffb77 in std::thread::_Invoker<:tuple> > >::operator()(void) (this=0xce69628) at /usr/include/c++/7/thread:243
#11 0x00007fffeb8ffb56 in std::thread::_State_impl<:thread::_invoker> > > >::_M_run(void) (this=0xce69620) at /usr/include/c++/7/thread:186
#12 0x00007ffff7a66a00 in ?? () from bin/libstdc++.so.6
#13 0x00007ffff796a609 in start_thread (arg=) at pthread_create.c:477
#14 0x00007fffeb723293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

YijinLiu avatar Oct 12 '21 06:10 YijinLiu

Well, even more weird: I am not allowed to reopen it!

YijinLiu avatar Oct 12 '21 06:10 YijinLiu

@YijinLiu Issue reopened. Can you please share any steps to reproduce ? Which platform are you working on ?

manoj7410 avatar Oct 12 '21 06:10 manoj7410

I cannot find a way to repro it. We have a Linux program running detection on security cameras all the time. It may happen once every week.

Do you think these "USB transfer error" are normal?

YijinLiu avatar Oct 12 '21 06:10 YijinLiu

@YijinLiu What is the power source that you are using to boot your machine/platform ? How many USB Accelerators are connected with the machine ? Is there any other USB device connected as well ?

manoj7410 avatar Oct 12 '21 07:10 manoj7410

It's a desktop machine, with CPU i7-10700, running ubuntu 20, with kernel 5.11. I only have one USB accelerator connected to the machine.

We have users with many other configurations that are experiencing the same issue.

YijinLiu avatar Oct 12 '21 08:10 YijinLiu

I'll try to run a demo on my machine which has Core i7. Will let you know if I am able to repro this issue at my end. Meanwhile, if you can find some specific steps to repro this then that will be really helpful.

manoj7410 avatar Oct 12 '21 10:10 manoj7410

I think I found a bug in the code. I am not sure whether it's the cause for this deadlock though. Testing it now, we will know after a week..

diff --git a/driver/usb/usb_driver.cc b/driver/usb/usb_driver.cc
index 78beda4..1dcac42 100644
--- a/driver/usb/usb_driver.cc
+++ b/driver/usb/usb_driver.cc
@@ -1299,7 +1299,7 @@ void UsbDriver::WorkerThreadFunc() {
         }
       }
 
-      reevaluation_needed = ProcessIo().ValueOrDie();
+      if (ProcessIo().ValueOrDie()) reevaluation_needed = true;
 
       // TODO: Enter kPaused state when dma_scheduler_.IsEmpty(). Any
       // new task should kick the driver back to kOpen state. Note this is in
@@ -1311,7 +1311,7 @@ void UsbDriver::WorkerThreadFunc() {
     } else {
       StdCondMutexLock queue_lock(&callback_mutex_);
 
-      Lock2 unlock_both(state_lock, queue_lock);
+      Lock2 unlock_both(queue_lock, state_lock);
 
       if (callback_queue_.empty()) {
         VLOG(10) 

YijinLiu avatar Oct 13 '21 16:10 YijinLiu

There is another issue, a likely contention:

  • The worker thread (from driver/usb/usb_driver.cc) installs these event/interrupt/bulk-in readers, which timeout every 6 seconds by default.
  • The client calls UsbDriver::DoSubmit to submit jobs to the device.
  • If the readers timeout the same time UsbDriver::ProcessIo issues bulk-out requests, it may be stuck there for ever. I changed usb_timeout_millis to 0 to disable the timeout. It has been running for 10 days without the deadlock I saw before.

YijinLiu avatar Nov 09 '21 23:11 YijinLiu

Thank you for sharing these details. What all files were modified to solve the issue ?

manoj7410 avatar Nov 29 '21 06:11 manoj7410

Besides the diffs I posted previously, here are two extra changes:

diff --git a/driver/beagle/beagle_usb_driver_provider.cc b/driver/beagle/beagle_usb_driver_provider.cc
index dcfd804..2b82a00 100644
--- a/driver/beagle/beagle_usb_driver_provider.cc
+++ b/driver/beagle/beagle_usb_driver_provider.cc
@@ -105,7 +105,7 @@ ABSL_FLAG(bool, usb_enable_bulk_descriptors_from_device,
 ABSL_FLAG(bool, usb_enable_processing_of_hints,
           GetEnv("USB_ENABLE_PROCESSING_OF_HINTS", true),
           "USB set to true for driver to proactively send data to device.");
-ABSL_FLAG(int, usb_timeout_millis, GetEnv("USB_TIMEOUT_MILLIS", 6000),
+ABSL_FLAG(int, usb_timeout_millis, GetEnv("USB_TIMEOUT_MILLIS", 0),
           "USB timeout in milliseconds");
 ABSL_FLAG(bool, usb_reset_back_to_dfu_mode,
           GetEnv("USB_RESET_BACK_TO_DFU_MODE", false),
@@ -135,7 +135,7 @@ ABSL_FLAG(bool, usb_enable_overlapping_requests,
           "the current one.");
 ABSL_FLAG(bool, usb_enable_overlapping_bulk_in_and_out,
           GetEnv("USB_ENABLE_OVERLAPPING_BULK_IN_AND_OUT", true),
-          "Allows bulk-in trasnfer to be submitted before previous bulk-out "
+          "Allows bulk-in transfer to be submitted before previous bulk-out "
           "requests complete.");
 ABSL_FLAG(bool, usb_enable_queued_bulk_in_requests,
           GetEnv("USB_ENABLE_QUEUED_BULK_IN_REQUESTS", true),
diff --git a/driver/usb/usb_driver.h b/driver/usb/usb_driver.h
index 182f27e..d3580a9 100644
--- a/driver/usb/usb_driver.h
+++ b/driver/usb/usb_driver.h
@@ -151,7 +151,7 @@ class UsbDriver : public Driver {
     bool usb_fail_if_slower_than_superspeed{false};
 
     // General timeout for USB operations in milliseconds.
-    int usb_timeout_millis{6000};
+    int usb_timeout_millis{0};
 
     // If non-empty, the firmware image to use for automatic DFU.
     // This feature is only available when a device factory has been supplied.

chao-camect avatar Nov 29 '21 07:11 chao-camect

@YijinLiu Are these changes made to the generic Linux drivers?

manoj7410 avatar Nov 30 '21 06:11 manoj7410

It's libedgetpu. You can find the filenames in the diff...

chao-camect avatar Nov 30 '21 08:11 chao-camect

@chao-camect Got it. We are still not able to reproduce this. However, can you create a patch of the complete code change and submit for review?

manoj7410 avatar Dec 02 '21 03:12 manoj7410

To repro it, you'll need to add random sleep time between two calls to USB Coral. Since the default timeout is 6s, the sleep time need to be around that to replicate. I'll find some time to create the PR.

chao-camect avatar Dec 02 '21 10:12 chao-camect

Hey, I got the same issue occasional happened on my coral mini as :

I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread
I driver/usb/usb_driver.cc:1138] bulk in 8 bytes from buffer index [5]
I driver/dma_chunker.cc:30] Completed 0 bytes; Outstanding 0 bytes; Processing next 8 bytes
I driver/usb/usb_driver.cc:916] [3-3] bulk in for 8 bytes has yielded 8 bytes from index [5]
I driver/usb/usb_driver.cc:1310] WorkerThreadFunc re-evaluation is needed
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 0 callback events in worker thread
I driver/usb/usb_driver.cc:1260] WorkerThreadFunc Installing bulk-in reader. buffer index [5]
I driver/usb/local_usb_device.cc:748] AsyncBulkInTransfer
I driver/usb/local_usb_device.cc:761] ASYNC IN 1 begin
I driver/single_queue_dma_scheduler.cc:154] Completing DMA[3]
I driver/usb/usb_driver.cc:550] IO completed
I driver/single_queue_dma_scheduler.cc:154] Completing DMA[4]
I driver/single_tpu_request.cc:410] [313] NotifyCompletion()
I driver/package_registry.cc:658] Returned instruction buffers back to executable reference
I driver/single_tpu_request.cc:478] [313] SetState old=3, new=4.
I driver/single_queue_dma_scheduler.cc:234] Request[313]: Completed
I driver/single_tpu_request.cc:96] [313] Request destroyed.
I driver/usb/usb_driver.cc:550] IO completed
I driver/usb/usb_driver.cc:1317] WorkerThreadFunc waiting on state change
I driver/usb/usb_driver.cc:91] Unlocks both mutex
I driver/usb/local_usb_device.cc:672] ASYNC IN 3 end

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:642] UnregisterCompletedTransfer
I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread

I driver/usb/usb_driver.cc:468] USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/usb_driver.cc:1232] WorkerThreadFunc Re-installing interrupt reader
I driver/usb/local_usb_device.cc:785] AsyncInterruptInTransfer
I driver/usb/local_usb_device.cc:798] ASYNC IN 3 begin
I driver/usb/usb_driver.cc:1317] WorkerThreadFunc waiting on state change
I driver/usb/usb_driver.cc:91] Unlocks both mutex
I driver/usb/local_usb_device.cc:672] ASYNC IN 1 end

I driver/usb/local_usb_device.cc:97] ConvertLibUsbTransferStatus: USB transfer error 2 [LibUsbDataInCallback]
I driver/usb/local_usb_device.cc:642] UnregisterCompletedTransfer
I driver/usb/usb_driver.cc:86] lock (does nothing)
I driver/usb/usb_driver.cc:1322] WorkerThreadFunc driver state change detected
I driver/usb/usb_driver.cc:1174] WorkerThreadFunc dispatching 1 callback events in worker thread
I driver/usb/usb_driver.cc:1260] WorkerThreadFunc Installing bulk-in reader. buffer index [6]

And I am using python to run the 'edgetpu.run_inference()', the behavior is pretty un-consistent, seems have relation with different USB power supply, so is there any USB power requirements for the coral mini running tpu? please give any debug or solution advice, thanks!

xxs1989 avatar Sep 28 '22 16:09 xxs1989

Hi @xxs1989 the board requires 5V/2A power supply. Please check the section 5.1 for more details at: https://coral.ai/static/files/Coral-Dev-Board-Mini-datasheet.pdf. Thanks!

hjonnala avatar Sep 29 '22 03:09 hjonnala