zigpy-znp icon indicating copy to clipboard operation
zigpy-znp copied to clipboard

Coordinator freezes with SimpleLink SDK >= 6.x.x

Open dumpfheimer opened this issue 2 years ago • 65 comments

When using koenkk`s development firmware, that is built on SimpleLink SDK Version 6.10 or 6.20, the coordinator seems to freeze under certain conditions.

It seems like memory issues that causes the lock-ups, which might be triggered at high loads or simply after some time by chance.

I believe this might be an issue only since a month or two and, while I do believe the root cause is somewhere in the coordinator firmware, I think it was fairly recent changes in zigpy that started triggering the bug. This is why:

I have had the development firmware from Feb running since Feb without issues. Some time a month ago the issues started (I was very likely on zigpy dev) when I upgraded the firmware to the latest dev build from koenkk. Downgrading to Feb firmware did not fix the issue. I had to downgrade all the way to a SDK 5.x.x Version to have a stable environment again.

Have there been noticable changes in July (+/-)?

@puddly you commented on an issue I created here and mentioned RAM usage here .

Is there something that could be done within zigpy to reduce memory usage on the controller (without loss of function mentioned here )? Or do you, zigpy devs, believe this must be fixed in SimpleLink?

Thanks for your work, it's very much appreciated.

dumpfheimer avatar Aug 23 '22 07:08 dumpfheimer

Could you email me (or upload here) a complete log of it crashing? I've been running that firmware for a few days and have experienced no problems with it. zigpy-znp should recover even from complete coordinator lock-ups (i.e. where the serial port is still alive) and try to reconnect.

puddly avatar Aug 24 '22 18:08 puddly

The serial port is not alive anymore when this happens. The only thing that helps is power cycling the coordinator. Do you need debug turned on? Or should warnings be enough?

dumpfheimer avatar Aug 24 '22 19:08 dumpfheimer

I see. So the lockup is detected, but the coordinator itself will never reset and zigpy-znp endlessly tries to reconnect? Are you using a TCP coordinator or is this a USB one?

puddly avatar Aug 24 '22 19:08 puddly

It's a Cc1352P2 launchpad connected via USB.

Yes if I remember correctly it seemed like zigpy recognized the connection issue and tried to reconnect but everything seemed to result in TimeoutError s (from the top of my head)

dumpfheimer avatar Aug 24 '22 19:08 dumpfheimer

Turned on debug logging

dumpfheimer avatar Aug 24 '22 19:08 dumpfheimer

Thanks. I feel like this is still a firmware problem: reading too slowly from a serial port shouldn't cause an unrecoverable crash, even if the other side of the serial connection is doing something to cause it.

That being said, moving pyserial-asyncio to another thread like you did may be necessary. We already do this with bellows (https://github.com/zigpy/bellows/blob/3c3ee0296d35eb43d0493eb9b2160bc4484e892c/bellows/uart.py#L386).

puddly avatar Aug 24 '22 19:08 puddly

I'm with you. No matter what the controller should stay up IMHO. It just got me thinking that z2m users seem to not have the same issues.

I am having a good experience with the dedicated thread. But I still get the seemingly random lock ups (eg at night).

Thanks for the link to bellows, maybe I can get some ideas for my znp PoC.

dumpfheimer avatar Aug 24 '22 19:08 dumpfheimer

@dumpfheimer Can you replicate the crash with https://github.com/puddly/serialpy?

You'll have to slightly patch zigpy-znp to use it but I'm wondering if pyserial itself may be the cause:

diff --git a/zigpy_znp/uart.py b/zigpy_znp/uart.py
index 5571e60..eb662e1 100644
--- a/zigpy_znp/uart.py
+++ b/zigpy_znp/uart.py
@@ -21,6 +21,10 @@ with warnings.catch_warnings():
     import serial_asyncio  # noqa: E402


+import serialpy as serial
+import serialpy as serial_asyncio
+
+
 LOGGER = logging.getLogger(__name__)

puddly avatar Aug 25 '22 15:08 puddly

I think I have it running. Can I somehow verify that it's using your serialpy impl?

dumpfheimer avatar Aug 25 '22 19:08 dumpfheimer

If you add serialpy.serial: debug to your HA logging config, you will see lines like this on startup:

2022-08-25 15:38:47.774 ubuntu zigpy_znp.api DEBUG Toggling RTS/DTR pins to skip bootloader or reset chip
2022-08-25 15:38:47.775 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=False
2022-08-25 15:38:47.775 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:47.775 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000004
2022-08-25 15:38:47.926 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=True
2022-08-25 15:38:47.926 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:47.927 ubuntu serialpy.serial DEBUG Setting modem bits: 0x00000004
2022-08-25 15:38:48.079 ubuntu zigpy_znp.uart DEBUG Setting serial pin states: DTR=False, RTS=False
2022-08-25 15:38:48.079 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000002
2022-08-25 15:38:48.079 ubuntu serialpy.serial DEBUG Clearing modem bits: 0x00000004

puddly avatar Aug 25 '22 19:08 puddly

Can you make sense of this error?

2022-08-25` 22:07:00.143 DEBUG (MainThread) [zigpy_znp.uart] Connecting to /dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if00 at 115200 baud
2022-08-25 22:07:00.147 DEBUG (MainThread) [zigpy_znp.api] Connection to /dev/serial/by-id/usb-Texas_Instruments_XDS110__03.00.00.20__Embed_with_CMSIS-DAP_L4300230-if00 failed, cleaning up
2022-08-25 22:07:00.147 ERROR (MainThread) [zigpy.application] Couldn't start application
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
    self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
    _, protocol = await serial_asyncio.create_serial_connection(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
    transport = transport_factory(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 30, in __init__
    self._serial = Serial(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/serial.py", line 111, in __init__                                         self._file = os.fdopen(self._fileno, "rb+")
  File "/usr/lib/python3.10/os.py", line 1029, in fdopen
    return io.open(fd, mode, buffering, encoding, *args, **kwargs)
io.UnsupportedOperation: File or stream is not seekable.
2022-08-25 22:07:00.153 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback ZnpMtProtocol.connection_made(<serialpy.Ser...x7eff2c21ef00>):   File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
    await app.startup(auto_form=auto_form)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
    self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
    _, protocol = await serial_asyncio.create_serial_connection(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
    transport = transport_factory(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 29, in __init__
    super().__init__(loop, protocol, path, waiter, extra)
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/descriptor_transport.py", line 50, in __init__
    self._loop.call_soon(self._protocol.connection_made, self)
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 66, in connection_made
    LOGGER.debug("Opened %s serial port", transport.serial.name)
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 44, in serial
    return self._serial
AttributeError: 'SerialTransport' object has no attribute '_serial'
2022-08-25 22:07:00.155 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback BaseSelectorEventLoop.add_reader(44, <bound method...7eff2c21ef00>>):   File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
    await app.startup(auto_form=auto_form)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
    self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
    _, protocol = await serial_asyncio.create_serial_connection(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
    transport = transport_factory(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 29, in __init__
    super().__init__(loop, protocol, path, waiter, extra)
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/descriptor_transport.py", line 51, in __init__
    self._loop.call_soon(self._loop.add_reader, self._fileno, self._read_ready)
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 256, in _add_reader
    key = self._selector.get_key(fd)
  File "/usr/lib/python3.10/selectors.py", line 193, in get_key
    raise KeyError("{!r} is not registered".format(fileobj)) from None
KeyError: '44 is not registered'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 331, in add_reader
    self._add_reader(fd, callback, *args)
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 258, in _add_reader
    self._selector.register(fd, selectors.EVENT_READ,
  File "/usr/lib/python3.10/selectors.py", line 360, in register
    self._selector.register(key.fd, poller_events)
OSError: [Errno 9] Bad file descriptor
2022-08-25 22:07:00.178 ERROR (MainThread) [homeassistant.components.zha.core.gateway] Couldn't start ZNP = Texas Instruments Z-Stack ZNP protocol: CC253x, CC26x2, CC13x2 coordinator
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/core/gateway.py", line 170, in async_initialize
    self.application_controller = await app_controller_cls.new(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 126, in new
    await app.startup(auto_form=auto_form)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/application.py", line 106, in startup
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 688, in connect
    self._uart = await uart.connect(self._config[conf.CONF_DEVICE], self)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/uart.py", line 186, in connect
    _, protocol = await serial_asyncio.create_serial_connection(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 67, in create_serial_connection
    transport = transport_factory(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/__init__.py", line 30, in __init__
    self._serial = Serial(
  File "/srv/homeassistant/lib/python3.10/site-packages/serialpy/serial.py", line 111, in __init__                                         self._file = os.fdopen(self._fileno, "rb+")
  File "/usr/lib/python3.10/os.py", line 1029, in fdopen
    return io.open(fd, mode, buffering, encoding, *args, **kwargs)
io.UnsupportedOperation: File or stream is not seekable.```

dumpfheimer avatar Aug 25 '22 20:08 dumpfheimer

Maybe buffering=0? open("/dev/tty", "r+b", buffering=0)

From https://bugs.python.org/issue20074

dumpfheimer avatar Aug 25 '22 20:08 dumpfheimer

Seems like that did the trick

dumpfheimer avatar Aug 25 '22 20:08 dumpfheimer

Nice find, thanks!

puddly avatar Aug 25 '22 20:08 puddly

Okay, got your latest serialpy (starts fine now, thanks) and flashed the 20220726 firmware again. Also, I disabled my "threaded zigpy-znp stuff" just to be sure.

dumpfheimer avatar Aug 25 '22 20:08 dumpfheimer

Just curious, was there a reason you started your own serial implementation?

dumpfheimer avatar Aug 25 '22 21:08 dumpfheimer

Pyserial has an entirely synchronous API and pyserial-asyncio as a shim doesn't actually delegate file descriptor reads to the event loop: it still uses select() internally, after the event loop notifies it of new data.

This is more of a test than anything else. Maintaining an entire serial library that is more than just POSIX-compliant is a little out of scope for zigpy but if this alternative implementation doesn't have the same bugs, it may help narrow down where to look.

puddly avatar Aug 25 '22 21:08 puddly

Very interesting! And a bit over my head :-D

My controller locked up again over night. These are the last lines:

2022-08-26 03:15:05.741 DEBUG (MainThread) [zigpy_znp.api] Sending request: SYS.Ping.Req()
2022-08-26 03:15:05.742 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xE087:1:0x0b04]: failed to get attributes '['active_power', 'rms_current', 'rms_voltage']' on 'electrical_measurement' cluster:
2022-08-26 03:15:05.812 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0xE087:1:0x0702]: failed to get attributes '['current_summ_delivered', 'status']' on 'smartenergy_metering' cluster:
2022-08-26 03:15:10.742 ERROR (MainThread) [zigpy_znp.zigbee.application] Watchdog check failed
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 984, in request
    response = await response_future
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 709, in _watchdog_loop
    await self._znp.request(c.SYS.Ping.Req())
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 982, in request
    async with async_timeout.timeout(self._znp_config[conf.CONF_SREQ_TIMEOUT]):
  File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-08-26 03:15:10.743 DEBUG (MainThread) [zigpy_znp.zigbee.application] Connection lost:
2022-08-26 03:15:10.744 DEBUG (MainThread) [zigpy_znp.uart] Closing serial port
2022-08-26 03:15:10.745 DEBUG (MainThread) [zigpy_znp.zigbee.application] Restarting background reconnection task

I will now try again with the dedicated thread in zigpy znp

dumpfheimer avatar Aug 26 '22 05:08 dumpfheimer

I am not sure if restarting HA couple of times would make things better or worse, but up until now I have had no issue with serialpy + dedicated ZNP thread

dumpfheimer avatar Aug 26 '22 14:08 dumpfheimer

Still no crash 🙂

dumpfheimer avatar Aug 27 '22 07:08 dumpfheimer

Still no crash, I think you're on to something with the serialpy 👏

dumpfheimer avatar Aug 27 '22 19:08 dumpfheimer

Was it crashing before, using your serial thread but with pyserial instead?

puddly avatar Aug 27 '22 19:08 puddly

Yes, this is the first setup that seems stable with a SimpleLink SDK 6 firmware

dumpfheimer avatar Aug 27 '22 19:08 dumpfheimer

My controller froze over night 😟 will try to find something in the logs

dumpfheimer avatar Aug 28 '22 05:08 dumpfheimer

One thing that just came to my mind: I have quite a few smart plugs witch energy measurement. (TS011F)

When I started using them in December they did not Report some Attributes on their own - and I think they still don't. I believe someone built a timer in zigpy / quirks that does that now (if I remember correctly every 30s).

One thing that MIGHT make a difference is if one of these plugs is unavailable - which was not the case the night before, but was the case tonight.

Maybe route discovery has a memory leak or something similar? Maybe the timer is messing with IO?

dumpfheimer avatar Aug 28 '22 07:08 dumpfheimer

This might be a clue! I went to the device page of an unavailable device (that plug) and went to the electrical measurement cluster and requested the current power few dozen time. My controller crashed within a few hours. Until then the log was full with route discovery requests and errors

dumpfheimer avatar Aug 28 '22 09:08 dumpfheimer

Does someone know if/how the attribute polling could affect the controller/serial loop?

dumpfheimer avatar Aug 28 '22 15:08 dumpfheimer

Hmmm. Maybe try dropping the request concurrency? It should be 16 for you. Maybe try 4 or 8? https://github.com/zigpy/zigpy-znp#configuration

I think we should just try to get a proper debug report from one of the TI dev kits of the state of the microcontroller when it crashes and submit a bug report to Texas Instruments.

puddly avatar Aug 28 '22 16:08 puddly

I'll try that!

How could I do the dump? I have a headless server. Is there a open source tool in the Ubuntu repositories that could do that?

dumpfheimer avatar Aug 28 '22 16:08 dumpfheimer

Is it possible to initiate a route discovery manually? I would like to try to automatically issue a bunch of route discovery requests to try and see if that might be the issue.

dumpfheimer avatar Aug 28 '22 17:08 dumpfheimer