python-mercuryapi icon indicating copy to clipboard operation
python-mercuryapi copied to clipboard

Program gets stuck in reader.stop_reading()

Open HanYangZhao opened this issue 4 years ago • 47 comments

I'm testing the program by repeating reader.start_reading() and reader.stop_reading(). At some point (few minutes or few hours), reader.stop_reading() will not return resulting in the program getting stuck. I was unable to replicate the problem with a pure C implementation, so, there is something going on with the way python threads is mixed with c threads. Using gdb I was able to confirm that one of the c threads gets stuck

30 Jun 06:45:18 2020 - initializeReader
30 Jun 06:45:18 2020 - stopping current reader
stopping_read
read_callback_null
stats_callback_null
stopping_read_cs
read_callback
^C
Thread 1 "python3" received signal SIGINT, Interrupt.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) info thread
  Id   Target Id         Frame 
* 1    Thread 0x76fee210 (LWP 3019) "python3" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  2    Thread 0x769b3460 (LWP 3025) "python3" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  3    Thread 0x75fff460 (LWP 3026) "python3" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  4    Thread 0x757fe460 (LWP 3027) "python3" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
(gdb) bt 
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76ec3072 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x3f22b4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#2  __pthread_cond_wait_common (abstime=0x0, mutex=0x3f21e0, cond=0x3f2288) at pthread_cond_wait.c:502
#3  __pthread_cond_wait (cond=0x3f2288, mutex=0x3f21e0) at pthread_cond_wait.c:655
#4  0x76b6757a in TMR_stopReading (reader=0x3f10e0) at tm_reader_async.c:387
#5  0x76b5fab0 in Reader_stop_reading (self=0x3f10d8) at mercury.c:976
#6  0x0009ffb2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 4
[Switching to thread 4 (Thread 0x757fe460 (LWP 3027))]
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      in ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76ec5194 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=1, futex_word=0x3f21a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#2  do_futex_wait (sem=sem@entry=0x3f21a8, abstime=0x0) at sem_waitcommon.c:115
#3  0x76ec5274 in __new_sem_wait_slow (sem=0x3f21a8, abstime=0x0) at sem_waitcommon.c:282
#4  0x76b6839e in process_async_response (reader=0x3f10e0) at tm_reader_async.c:977
#5  0x76b689b2 in do_background_reads (arg=0x3f10e0) at tm_reader_async.c:1218

tm_reader_async.c:1218  -> process_async_response(reader);
tm_reader_async.c:977 -> sem_wait(&reader->queue_slots);

HanYangZhao avatar Jun 29 '20 22:06 HanYangZhao

I'm experiencing a similar situation on Windows; program hangs sometimes when stop_reading() is called. The reader stops, but the program never returns. It's easy to make happen when doing a repeated sequence of start_reading(), delay for a few seconds, and then stop_reading(). I built python-mercury module from source using Mercury API 1.35.

kmauser avatar Dec 10 '21 13:12 kmauser

I ran into this issue as well. After much work with gdb, I discovered the issue and implemented a fix (see the PR I just opened). Here is my commit message explaining the isse:

As of mercuryapi-1.31.4.35, when tm_reader_async.c's TMR_stopReading() is called by Reader_stop_reading(), it ultimately ends up waiting for reader->readState to become TMR_READ_STATE_DONE while holding the GIL. This happens in tm_reader_async.c's do_background_reads(), but not until the tag queue is emptied by read callbacks. These are triggered by invoke_read_callback(), but this function blocks until it can acquire the GIL. This frequently results in deadlock upon attempting to stop an asynchronous read.

Previously, this was partially addressed by temporarily setting self->readCallback (and self->statsCallback) to NULL in Reader_stop_reading(), but this causes a SEGV in invoke_read_callback() if it occurs after self->readCallback is verified to not be NULL but before it is called. Instead of temporarily setting these function pointers to NULL, release the GIL in Reader_stop_reading(), allowing invoke_read_callback() to empty the tag queue and thereby avoiding deadlock. Then, use a mutex to prevent concurrent calls to TMR_stopReading(), since this was previously prevented by the GIL.

Temporarily setting self->statsCallback to NULL in Reader_stop_reading() also prevented stats callbacks from occurring in subsequent asynchronous reads. This functionality is now fixed.

charlescochran avatar May 20 '22 18:05 charlescochran

Very nice ! We've been waiting for a fix for this for a long time, will try it out, hope this gets merged ! Thanks !

natcl avatar May 20 '22 20:05 natcl

Thanks a lot for your effort, @charlescochran!

@natcl, please let me know whether the fix works for you as well. I will merge it then.

gotthardp avatar May 20 '22 20:05 gotthardp

Thanks, it might take a little while before we can test but I'll let you know !

natcl avatar May 20 '22 20:05 natcl

@charlescochran I tried your pull request and I'm still getting a freeze on stopReading after a couple of minutes of calling it in a loop (every 4 seconds) Were you able to test it over a long period of time on you side ?

natcl avatar Jun 03 '22 13:06 natcl

Hmm. I have been regularly testing it over long periods of time and it has been not freezing.

Perhaps you are encountering a separate bug. Can you run your program using gdb? If so, when the hang happens, please hit Ctrl-C, do a bt, and paste the result here.

charlescochran avatar Jun 03 '22 14:06 charlescochran

Thanks will do !

natcl avatar Jun 03 '22 14:06 natcl

itp@itp-master:/usr/tpm/bin$ gdb python3
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...(no debugging symbols found)...done.
(gdb) run uhfReader.py
Starting program: /usr/bin/python3 uhfReader.py
Cannot parse expression `.L1207 4@r4'.
warning: Probes-based dynamic linker interface failed.
Reverting to original interface.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
03 Jun 23:47:11 2022 - Starting ThingMagic reader
[New Thread 0x7683c460 (LWP 21173)]
03 Jun 23:47:11 2022 - Heartbeat started
03 Jun 23:47:11 2022 - initializeReader
03 Jun 23:47:11 2022 - stopReading: Reading was already stopped
03 Jun 23:47:11 2022 - Creating new reader on port tmr:///dev/ttymxc0
03 Jun 23:47:11 2022 - {"command": "initializeReader", "antennaPort": 1, "readPower": [[1, 2000], [2, 2000]], "region": "JP", "readerInitialized": true, "fwVersion": "01.0B.04.04-20.20.02.13-BL12.12.13.00", "serialNb": "3618527017600UM+", "readState": "TMR_READ_STATE_IDLE"}
03 Jun 23:47:11 2022 - startReading at 50 50 duty cycle
[New Thread 0x75eff460 (LWP 21181)]
[New Thread 0x756fe460 (LWP 21182)]
03 Jun 23:48:17 2022 - stopReading
^C
Thread 1 "python3" received signal SIGINT, Interrupt.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76f770d4 in __libc_recvfrom (fd=4, buf=0x7601b018, len=131072, flags=0, addr=..., addrlen=0x7effee38) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
#2  0x0002d2f2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

natcl avatar Jun 03 '22 14:06 natcl

I think I might have something missing to get the full stack ?

natcl avatar Jun 03 '22 14:06 natcl

Ah I installed ionotify and it seems better, new log coming soon

natcl avatar Jun 03 '22 14:06 natcl

That seems better:

04 Jun 00:05:17 2022 - initializeReader
04 Jun 00:05:17 2022 - stopReading: Reading was already stopped
04 Jun 00:05:17 2022 - {"command": "initializeReader", "antennaPort": 1, "readPower": [[1, 2000], [2, 2000]], "region": "JP", "readerInitialized": true, "fwVersion": "01.0B.04.04-20.20.02.13-BL12.12.13.00", "serialNb": "3618527017600UM+", "readState": "TMR_READ_STATE_DONE"}
04 Jun 00:05:17 2022 - startReading at 50 50 duty cycle
04 Jun 00:05:21 2022 - stopReading
04 Jun 00:05:25 2022 - initializeReader
04 Jun 00:05:25 2022 - stopReading: Reading was already stopped
04 Jun 00:05:25 2022 - {"command": "initializeReader", "antennaPort": 1, "readPower": [[1, 2000], [2, 2000]], "region": "JP", "readerInitialized": true, "fwVersion": "01.0B.04.04-20.20.02.13-BL12.12.13.00", "serialNb": "3618527017600UM+", "readState": "TMR_READ_STATE_DONE"}
04 Jun 00:05:25 2022 - startReading at 50 50 duty cycle
04 Jun 00:05:29 2022 - stopReading
04 Jun 00:05:33 2022 - initializeReader
04 Jun 00:05:33 2022 - stopReading: Reading was already stopped
04 Jun 00:05:33 2022 - {"command": "initializeReader", "antennaPort": 1, "readPower": [[1, 2000], [2, 2000]], "region": "JP", "readerInitialized": true, "fwVersion": "01.0B.04.04-20.20.02.13-BL12.12.13.00", "serialNb": "3618527017600UM+", "readState": "TMR_READ_STATE_DONE"}
04 Jun 00:05:33 2022 - startReading at 50 50 duty cycle
^C
Thread 1 "python3" received signal SIGINT, Interrupt.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76ec308e in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x3a382c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#2  __pthread_cond_wait_common (abstime=0x0, mutex=0x3a3758, cond=0x3a3800) at pthread_cond_wait.c:502
#3  __pthread_cond_wait (cond=0x3a3800, mutex=0x3a3758) at pthread_cond_wait.c:655
#4  0x769f8682 in TMR_stopReading () from /usr/lib/python3/dist-packages/mercury.cpython-36m-arm-linux-gnueabihf.so
#5  0x769f371c in ?? () from /usr/lib/python3/dist-packages/mercury.cpython-36m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

natcl avatar Jun 03 '22 14:06 natcl

Interesting. Please run git log --oneline and paste the result.

charlescochran avatar Jun 03 '22 14:06 charlescochran

Oh you want to verify if I'm running the correct code ? I'm not actually running this from a Git repo, it's packaged as a Debian file with other things. Let me check on my side to confirm I'm running your branch.

natcl avatar Jun 03 '22 15:06 natcl

I want to verify that, and also that you have the other latest commits on master. There have been 6 commits since the latest release. I think I remember the latest release having bugs that were fixed by them. My branch is on the latest master rather than the latest release, so as long as you are not cherry-picking it, you should also have those bug fixes as well (assuming you are running my branch, of course).

charlescochran avatar Jun 03 '22 15:06 charlescochran

So here's the commit that was included on our side: I basically merged your master branch. image

natcl avatar Jun 03 '22 16:06 natcl

That looks correct, in the sense that that is all of my changes. Can you verify that you have these commits, as well? https://github.com/gotthardp/python-mercuryapi/compare/v0.5.3...master Also, when you say "merged my master branch", do you mean that you merged it with your copy of upstream code or code that includes your modifications? If the latter, maybe that's the reason for our different results. In terms of your error: it does look similar to the deadlock I got (and thought I fixed), which is interesting. From your traces, it looks like you are calling stopReading twice for every call to startReading, once after startReading and once after initializeReader. The second one seems to give an error ("Reader was already stopped"). If you are actually calling the python wrapper's stop_reading() function twice, I would not be surprised if this results in a deadlock/error, even with my commit. I do not think the library supports multiple consecutive calls to stop_reading(), and it was not the problem I was attempting to fix.

charlescochran avatar Jun 03 '22 16:06 charlescochran

I do have those commits as well. I merged your branch but on a more recent local version (that has the get_read_state() method from this PR: https://github.com/gotthardp/python-mercuryapi/pull/117) So perhaps this could be interfering ? I'll try your branch without any other modifications.

We're only calling stopReading once, the reason it says "Reader was already stopped" is that if the reader is started when we call initializeReader, we stop it, but in this case it's not called.

natcl avatar Jun 03 '22 17:06 natcl

I checked my program again and we are calling other methods before start_reading, notably we're calling set_region, set_read_plan and set_read_powers. This may be the cause for the lockup. I restarted the test by calling only start_reading and stop_reading and so far no crash (it's been running 3 hours). I'll leave it running over night but I think that the fix for stop_reading does indeed work ! I think the other lockup I have is not related to that specific fix. Will report tomorrow !

natcl avatar Jun 03 '22 20:06 natcl

I may have spoke too fast, just after hitting send it crashed again, I wasn't running gdb this time so will run it again with gdb. I have the impression that it got stuck in start_reading rather than stop_reading....

natcl avatar Jun 03 '22 20:06 natcl

Gotcha. I have never seen it get stuck in start_reading, so I did not attempt to fix that. From my experience, the deadlock in stop_reading happens quite quickly (after a few seconds/minutes), especially if you increase the rate of starting and stopping (try every 0.25 seconds instead of every 4). Perhaps you could test again without my commit and see if it deadlocks quickly in stop_reading. Then test with my fix, and hopefully it will only crash occasionally in start_reading.

charlescochran avatar Jun 03 '22 20:06 charlescochran

Here's the stack I got, it seems to still be in stop_reading:

04 Jun 06:57:17 2022 - startReading at 50 50 duty cycle
04 Jun 06:57:21 2022 - stopReading
04 Jun 06:57:25 2022 - startReading at 50 50 duty cycle
04 Jun 06:57:29 2022 - stopReading
04 Jun 06:57:33 2022 - startReading at 50 50 duty cycle
04 Jun 06:57:37 2022 - stopReading
04 Jun 06:57:41 2022 - startReading at 50 50 duty cycle
04 Jun 06:57:45 2022 - stopReading
04 Jun 06:57:49 2022 - startReading at 50 50 duty cycle
^C
Thread 1 "python3" received signal SIGINT, Interrupt.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76ec308e in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x3a382c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#2  __pthread_cond_wait_common (abstime=0x0, mutex=0x3a3758, cond=0x3a3800) at pthread_cond_wait.c:502
#3  __pthread_cond_wait (cond=0x3a3800, mutex=0x3a3758) at pthread_cond_wait.c:655
#4  0x769f8682 in TMR_stopReading () from /usr/lib/python3/dist-packages/mercury.cpython-36m-arm-linux-gnueabihf.so
#5  0x769f371c in ?? () from /usr/lib/python3/dist-packages/mercury.cpython-36m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

natcl avatar Jun 03 '22 22:06 natcl

I tried official build and it indeed crashes sooner (with the same stack trace as above) The newer code crashes also but it takes more time. So definitely an improvement although not 100% stable yet.

natcl avatar Jun 06 '22 14:06 natcl

Interesting. If I run into a crash with the new code, I will let you know.

charlescochran avatar Jun 06 '22 14:06 charlescochran

Also, if I may ask, what is your reason for starting/stopping the radio on a regular interval? I know this should be fixable, but one reason for wanting to do this is to allow periodic calls to get_temperature(). After my change, this should be no longer be necessary, as the stats callback now works for all asynchronous reads, not just the first one.

charlescochran avatar Jun 06 '22 15:06 charlescochran

On our side it's because we need to write to the tag also so before writing we need to stop reading.

natcl avatar Jun 06 '22 15:06 natcl

By the way, on our side we're getting the temperature by enabling stats (the reader.enable_stats method).

natcl avatar Jun 06 '22 15:06 natcl

With the original code (without my code), does stats work for you after the first time you restarting reading? For me, with the original code, the stats callbacks would simply stop getting called after restarting the reading for the first time. For me, my code fixes this.

charlescochran avatar Jun 06 '22 16:06 charlescochran

It does seem to work on our side, Do you mean that it works the first time you start the reader, then if you call stop and start it doesn't work anymore ?

natcl avatar Jun 06 '22 16:06 natcl

Yep

charlescochran avatar Jun 06 '22 16:06 charlescochran

Ok that's strange, never had that issue here.

natcl avatar Jun 06 '22 16:06 natcl

@natcl I may have found the same deadlock you are experiencing with my code. This deadlock is directly related to an exception called "Buffer overflow", which occurs when asynchronous tag callbacks are not processing fast enough, leading to an excess (100+) of tag reads waiting to be processed.

If you don't mind, try enabling exception handling (if you haven't already) with Reader.enable_exception_handler() and print the exception in your exception callback. Then, see if a "Buffer overflow" (and/or "Timeout") is printed right before you deadlock. I would love to know if we are experiencing the same issue!

charlescochran avatar Jun 24 '22 19:06 charlescochran

By the way, you can easily trigger a "Buffer overflow" by adding a sleep (0.1 seconds should do it; 0.5 definitely will) to your read callback function.

charlescochran avatar Jun 24 '22 20:06 charlescochran

Thanks, will give it a try soon !

natcl avatar Jun 25 '22 00:06 natcl

@natcl @gotthardp I've investigated the buffer overflow error (caused by delays in the tag read callback). Ideally, the error would result in either a. the reader simply dropping tag reads until the buffer is no longer full (which is, I believe, the intended behavior) or b. the reader stopping its read and restarting it once once the buffer is no longer full. Unfortunately, neither of these occur. Due to a bug in the underlying MercuryAPI, a buffer overflow leaves the reader in a broken state, recoverable only by reinitialization. Manually stopping the read after a buffer overflow and before the "Timeout" error which occurs ~11 seconds later results in a deadlock. Nathanaël, I believe this is responsible for the deadlock you are still experiencing. After determining the failure to properly handle buffer overflows to be in the underlying MercuryAPI rather than in python-mercuryapi, I realized that python-mercuryapi is using MercuryAPI v1.31.4.35, released in Jan 2020. Since then, there have been three releases. The latest, v1.35.2.72, was released earlier this month. Its release notes mentioned something that sounded like a fix (the addition of the TMR_flush method), so I upgraded versions. In v1.35.2.72, buffer overflows notably results in an exception ("CRC error") being repeatedly raised, and a "Timeout" error does not occur after ~11 seconds. Stopping the read, however, still results in a deadlock, and I am still unable to find a way of recovering from a buffer overflow without reinitializing the reader. The TMR_flush method does not help. A buffer overflow can potentially be addressed by waiting for the "Timeout" exception to occur after ~11 seconds, stopping the read, reinitializing the reader, and starting the read, but this is not an ideal solution. The work-around I am adopting for my application is to attempt to prevent buffer overflows from occurring by no longer doing any I/O in my registered tag read callback function. Instead, I am simply storing the TagReadData object in a queue.Queue object, which is, in turn, processed and emptied by a different thread. Nathanaël, once you verify that your deadlock is preceded by a buffer overflow, taking a similar approach may allow you to proceed. This is, of course, also not an ideal solution. I may contact Jadak and explain the bug in hopes that it may be fixed in a subsequent version. As the buffer overflow deadlock is, in fact, unrelated to the original SEGV and deadlock I fixed, I believe we should still merge my prior commit. At this time, I will not commit the upgrade to MercuryAPI v1.35.2.72, as this upgrade did not fix the buffer overflow issue. NOTE for anyone interested in eventually merging an upgrade to MercuryAPI v1.35.2.72: this version adds a preprocessor macro called TMR_ENABLE_CRC, which is defined to be 0 in tm_config.h. Several #if TMR_ENABLE_CRC statements are added around existing sections of code. Since TMR_ENABLE_CRC is 0 rather than 1 by default, this has the effect of changing default behavior. In fact, this prevents me, using a M6e (serial) reader, from initializing my reader in the first place (trying to do so results in a "Timeout" exception). Redefining TMR_ENABLE_CRC to be 1 (which would require a patch to make default in python-mercuryapi) allows me to initialize a mercury.Reader object. Note, however, that after this is done, initialization still fails with a "Timeout" exception if the previous run experienced a buffer overflow error. This last behavior does not occur in v1.31.4.35.

charlescochran avatar Jun 29 '22 21:06 charlescochran

@natcl @gotthardp Update: I submitted a bug report to Jadak. They told me that they were able to reproduce the issue and will let me know if there any updates regarding a solution.

charlescochran avatar Jul 07 '22 12:07 charlescochran

@charlescochran thanks for the detailed testing ! That's interesting, on the site we are using the readers (around 60 of them) some devices will sometimes have timeout errors. That seems to happen only on specific devices and it doesn't seem to be related to tags being present or not. It's probably not the same issue though. I assume in the case you describe the buffer overflow happens only when there are tags being read ? Let me know if it's still useful for me to test !

Thanks !

natcl avatar Jul 07 '22 13:07 natcl

Yes, I've found that timeout errors can occur for other reasons, for example in reader initialization. A buffer overflow seems to cause a timeout error (in v1.31.4.35, anyway), but other things can cause them too. I believe that the timeout error itself is not causing the deadlock. I would still love you to see if a buffer overflow is occurring prior to your deadlock. It is true that this should only occur when tags are being read. Just make sure exception callbacks are enabled so you will actually see the error. As I mentioned earlier, you can verify that you will see it by reproducing it by adding a sleep() to your read callback function.

charlescochran avatar Jul 07 '22 13:07 charlescochran