deconz-rest-plugin icon indicating copy to clipboard operation
deconz-rest-plugin copied to clipboard

Repeated core dumps from deCONZ as Home Assistant add-on

Open Bert-R opened this issue 3 years ago • 30 comments

Describe the bug

I am running the deCONZ software as an add-on on Home Assistant. That add-on ones in a while enters a period of repeated crashes (core dumps). I have logged a ticket on the add-on (home-assistant/addons#2442), but given the add-on just wraps the deCONZ software, I've been referred to this project.

Once in a while, the deCONZ add-on enters a period where it constantly dumps core and gets restarted by the supervisor. During that period, the Zigbee devices cannot be controlled. The problem is intermittent. Over the past 90 days, the deCONZ add-on core dumped 687 times in 7 periods.

Steps to reproduce the behavior

At the moment, I do not see a pattern and have no way to trigger this behavior. I can gather extra data when needed. I'm running Prometheus Node Exporter, so I have OS-level metrics, but these do not provide any clues to me.

Expected behavior

No crashes

Screenshots

Not applicable

Environment

  • Host system: Raspberry Pi
  • Running method: Home Assistent deCONZ Add-on
  • Firmware version: 26660700
  • deCONZ version: 2.14.1
  • Device: ConBee II
  • Do you use an USB extension cable: Yes
  • Is there any other USB or serial devices connected to the host system? Yes, a Aeon Labs USB Z-Wave Plus Controller

deCONZ Logs

No logs available at the moment

Additional context

Not sure what is relevant in this case.

Bert-R avatar Apr 30 '22 09:04 Bert-R

Hi,

We don't maintain the addon. However, it shouldn't crash. Are you able to provide a core dump?

Mimiix avatar Apr 30 '22 12:04 Mimiix

I think this should be fixed with the upcoming v2.15.3 version, there were two fixes after v2.14.1 related to crashes under certain conditions.

manup avatar Apr 30 '22 14:04 manup

That would be great. I'd love to provide the core dump but have no clue how to extract that from the add-on Docker container.

Bert-R avatar Apr 30 '22 14:04 Bert-R

The Home Assistant community has released an update of the add-on based on v2.15.3 and I've upgraded to that version. I've set an alert to get notified if it crashes again, so I'll keep a close watch. Thanks!

Bert-R avatar May 03 '22 06:05 Bert-R

Unfortunately, the issue is not resolved. Yesterday between 14:10 and 14:33, the deCONZ add on had 40 core dumps. The problem is very intermittent: 33 days passed since the previous set of crashes. What can I do to help analyze this issue?

Bert-R avatar May 23 '22 17:05 Bert-R

Today, it started crashing at 7:47. So far, it created 100 core dumps in 90 minutes, and counting.

Bert-R avatar May 27 '22 08:05 Bert-R

same here, maybe there is something releated with an internet connection? I don't understand why it's working for weeks and suddendly it happens to more people in the world at the same time!

cagnulein avatar May 27 '22 08:05 cagnulein

Phoscon server is down. Probsbly has to do with that. We seen it in the past, Manuel wasn't able to figure out what happened.

You can disable discovery with the rest api, then it should stop.

Mimiix avatar May 27 '22 08:05 Mimiix

thanks @Mimiix could you please point me how can I disable this by rest API? It could be useful for everybody, thanks!

cagnulein avatar May 27 '22 08:05 cagnulein

It got resolved now. The last crash was 10 minutes ago, so it crashed between 7:47 and 10:48 CEST.

Bert-R avatar May 27 '22 09:05 Bert-R

ok I guess I did it now from the settings in the Phoscon page on HASS (advances settings, last setting)

cagnulein avatar May 27 '22 09:05 cagnulein

It got resolved now. The last crash was 10 minutes ago, so it crashed between 7:47 and 10:48 CEST.

yes same to me

cagnulein avatar May 27 '22 09:05 cagnulein

Phoscon is online again.

https://dresden-elektronik.github.io/deconz-rest-doc/endpoints/configuration/#modify-configuration

That's with the rest api.

Mimiix avatar May 27 '22 09:05 Mimiix

As there has not been any response in 21 days, this issue has been automatically marked as stale. At OP: Please either close this issue or keep it active It will be closed in 7 days if no further activity occurs.

github-actions[bot] avatar Jun 18 '22 02:06 github-actions[bot]

As far as I know, it's just coincidental that this issue didn't occur again in the past 21 days, so please keep this on the backlog and apply a structural fix if possible.

Bert-R avatar Jun 18 '22 11:06 Bert-R

As far as I know, it's just coincidental that this issue didn't occur again in the past 21 days, so please keep this on the backlog and apply a structural fix if possible.

I am not able to put something on the backlog if there's no clear pointer on what goes wrong😅.

Mimiix avatar Jun 18 '22 12:06 Mimiix

Is there anything I can do to gather more information for analysis? I've asked the maintainer of the Home Assistant deCONZ plug-in for the location of the core dump files.

I understand the crashes are caused by unavailability of the Phoscon server. Would it be possible to add some extra logging in that area, to see what happens in case of a connection failure?

Bert-R avatar Jun 18 '22 13:06 Bert-R

The coredumps would probably help. Logging, I am not sure but why not?

The odd thing is that not everyone is affected.

Mimiix avatar Jun 18 '22 13:06 Mimiix

As there has not been any response in 21 days, this issue has been automatically marked as stale. At OP: Please either close this issue or keep it active It will be closed in 7 days if no further activity occurs.

github-actions[bot] avatar Jul 10 '22 02:07 github-actions[bot]

This issue is not solved. On July 3rd, it crashed again, this time only once.

Bert-R avatar Jul 11 '22 16:07 Bert-R

Can you share the core dumps?

Mimiix avatar Jul 11 '22 16:07 Mimiix

Same thing was happening to me with the Community Docker container. Got so bad and unreliable I had to take my stick out of use and move the few devices on it over to my other Zigbee hub.

ryancasler avatar Jul 12 '22 01:07 ryancasler

Same thing was happening to me with the Community Docker container. Got so bad and unreliable I had to take my stick out of use and move the few devices on it over to my other Zigbee hub.

Not sure what your comment is contributing here 😅.

Mimiix avatar Jul 12 '22 06:07 Mimiix

I thought maybe someone would want to address that problem and maybe fix it.

ryancasler avatar Jul 13 '22 21:07 ryancasler

We can't without any core dumps themselves to give some pointers 😅

Mimiix avatar Jul 14 '22 04:07 Mimiix

@Mimiix the issue is just about the Phoscon servers. When they went offline we have the issue. Probably you can simulate the same thing, replacing the Phoscon server url with a wrong fake one and you will see the same thing

cagnulein avatar Jul 14 '22 06:07 cagnulein

@Mimiix the issue is just about the Phoscon servers. When they went offline we have the issue. Probably you can simulate the same thing, replacing the Phoscon server url with a wrong fake one and you will see the same thing

Which wrong fake one? I never had this issue in my environment. Manuel can't replicate it either.

Mimiix avatar Jul 14 '22 06:07 Mimiix

I mean since the issue is when the Phoscon are offline, simply debug it putting a fake one like "Phoscon.server" instead of the original URL. So it will cause to fail each time and you should be able to see the issue every time.

cagnulein avatar Jul 14 '22 06:07 cagnulein

I mean since the issue is when the Phoscon are offline, simply debug it putting a fake one like "Phoscon.server" instead of the original URL. So it will cause to fail each time and you should be able to see the issue every time.

This isn't causing an core dump on my side. I use a native deconz install. It simply can't reach the server, but that is the same when I block it on my routers firewall. I never get a crash. Additionally, you can disable the pinging to the discovery server with the rest api.

So again: we can't seem to replicate it and that's why we need the core dumps. We really need user input here in the form of a core dump, otherwise we can't solve it.

Mimiix avatar Jul 14 '22 07:07 Mimiix

As there has not been any response in 21 days, this issue has been automatically marked as stale. At OP: Please either close this issue or keep it active It will be closed in 7 days if no further activity occurs.

github-actions[bot] avatar Aug 05 '22 02:08 github-actions[bot]