esp8266_milight_hub icon indicating copy to clipboard operation
esp8266_milight_hub copied to clipboard

Memory leak - causes settings to be randomly cleared

Open poudenes opened this issue 5 years ago • 29 comments

Describe the bug

After restore settings I see all the information. But after a while everything is gone in WebUI, strange enough... everything is still working inside HA to control the lights

Steps to reproduce

Restore settings, after while goto the WebGUI and everything is gone again.

Expected behavior

After restore all settings keep in memory.

Additional info

I don't use anymore device name because that gives some issues as well.

Firmware version

1.10.5 (nodemcuv2)

Output of http://milight-hub.local/about: When WebGUI is empty

{"firmware":"milight-hub","version":"1.10.5","ip_address":"192.168.xxx.xxx","reset_reason":"Software/System restart","variant":"nodemcuv2","free_heap":6616,"arduino_version":"2_4_2","queue_stats":{"length":0,"dropped_packets":0}}

Output of http://milight-hub.local/about: When WebGUI is normal

{"firmware":"milight-hub","version":"1.10.5","ip_address":"192.168.xxx.xxx","reset_reason":"Software/System restart","variant":"nodemcuv2","free_heap":16016,"arduino_version":"2_4_2","queue_stats":{"length":0,"dropped_packets":0}}

Output of http://milight-hub.local/settings When WebGUI is empty

{}

Output of http://milight-hub.local/settings When WebGUI is normal

{"admin_username":"","admin_password":"","ce_pin":16,"csn_pin":15,"reset_pin":0,"led_pin":-2,"radio_interface_type":"nRF24","packet_repeats":50,"http_repeat_factor":1,"auto_restart_period":0,"mqtt_server":"192.168.xxx.xxx:1883","mqtt_username":"USERNAME","mqtt_password":"PASSWORD","mqtt_topic_pattern":"milight/:device_id/:device_type/:group_id","mqtt_update_topic_pattern":"milight/updates/:hex_device_id/:device_type/:group_id","mqtt_state_topic_pattern":"milight/states/:hex_device_id/:device_type/:group_id","mqtt_client_status_topic":"milight/client_status","simple_mqtt_client_status":true,"discovery_port":48899,"listen_repeats":0,"state_flush_interval":100,"mqtt_state_rate_limit":500,"packet_repeat_throttle_sensitivity":50,"packet_repeat_throttle_threshold":500,"packet_repeat_minimum":50,"enable_automatic_mode_switching":false,"led_mode_wifi_config":"Fast toggle","led_mode_wifi_failed":"Fast blip","led_mode_operating":"Slow toggle","led_mode_packet":"Fast blip","led_mode_packet_count":4,"hostname":"MilightHub","rf24_power_level":"MAX","rf24_listen_channel":"LOW","wifi_static_ip":"192.68.xxx.xxx","wifi_static_ip_gateway":"192.168.xxx.xxx","wifi_static_ip_netmask":"255.255.255.0","packet_repeats_per_loop":50,"home_assistant_discovery_prefix":"","wifi_mode":"n","default_transition_period":500,"rf24_channels":["LOW","MID","HIGH"],"device_ids":[43681,43682,43683,43684,43690,43691,43692,43693,43694,43695,43706,43707],"gateway_configs":[],"group_state_fields":["state","status","brightness","level","hue","saturation","color","mode","kelvin","color_temp","bulb_mode","computed_color","effect","device_id","group_id","device_type"],"group_id_aliases":{}}

poudenes avatar Feb 25 '20 10:02 poudenes

Thanks for reporting and providing the info. Left a comment on #601, guessing this is related.

This is very interesting:

{"free_heap":6616}

that's critically low and very likely to cause funky problems. Validates the hypothesis that there's a leak and something screwy is happening when heap is low.

Next steps are:

  1. Find what's causing settings to clear when heap is low, add as much guarding around that as possible.
  2. Potentially add auto-restart logic if heap gets too low
  3. Find the source of the leak (my best guess is the packet queue, which is the only place i can think of where memory is regularly allocated dynamically)
  4. Figure out ways to clear up heap. 16KB steady-state is pretty crappy.

sidoh avatar Feb 26 '20 17:02 sidoh

only thing what I can think about is next: I use Circadian Light option in HA. When I have some lights on and reboot my HA because of some changes, it will send lots of data to Hub.

When HA is booted, it will change all lights to right Kelvin, Brightness. In meanwhile it also receive all kind of mqtt information because its update the settings in HA also.

I realise that this clear settings never happend before (only at the moment I tried to add for every light a device name) but leave that and everything was ok. Last moment I try the Circadian Light option for a better healthy life... I will disable that for while and do same things (reboot HA sometimes because of changes). See what happen

Sorry for my bad explanation in English.... :) Im a dutchie!

poudenes avatar Feb 27 '20 08:02 poudenes

Definitely interesting. If you're able to find a way to reliably reproduce a memory leak (just check free_heap from /about), that'd be a huge help. Perhaps restarting HA a few times?

sidoh avatar Feb 28 '20 16:02 sidoh

@sidoh When im 2 days off I will do some tests. If HA is fail then I have time to repair it before my partner gets home in a dark house :)

poudenes avatar Feb 28 '20 16:02 poudenes

Great, much appreciated.

Hah! I can relate. Can't tell you how many times I've accidentally turned the lamp in the other room purple while testing things.

If it's possible for you to share your HA configs, or instructions on how to set up Circadian Light, that might help as well.

How many milight groups to you have set up in HA?

sidoh avatar Feb 28 '20 16:02 sidoh

@sidoh Here's my configuration: https://github.com/poudenes/homeassistant Here example to add Circadian Light

###########################################################################################
#
# CIRCADIAN LIGHTING
#
# https://github.com/claytonjn/hass-circadian_lighting
#
###########################################################################################
---
platform: circadian_lighting
name: "Circadian Bathroom"
lights_ct:
  - light.bathroom_all
disable_brightness_adjust: true

I have 32 individual groups in HA and 6 group 0 in HA

Im using the Patch for correct updates using MQTT from HA to Milight Hub:

###########################################################################################
#
# PATCH WHEN GROUP 0 IS ON OR OFF THE CORRECT STATUS WILL SEND TO MILIGHT HUB FOR ALL LIGHTS INSIDE GROUP
#
###########################################################################################

- alias: "Patches - MiLight Bedroom Group 0"
  initial_state: "true"
  trigger:

    platform: mqtt
    topic: "milight/states/0xAABA/rgb_cct/1"

  action:
    - service: mqtt.publish
      data_template:
        topic: "milight/states/0xAAAA/rgb_cct/1"
        payload_template: "{{ trigger.payload }}"

    - service: mqtt.publish
      data_template:
        topic: "milight/states/0xAAAA/rgb_cct/2"
        payload_template: "{{ trigger.payload }}"

    - service: mqtt.publish
      data_template:
        topic: "milight/states/0xAAAA/rgb_cct/3"
        payload_template: "{{ trigger.payload }}"

    - service: mqtt.publish
      data_template:
        topic: "milight/states/0xAAAA/rgb_cct/4"
        payload_template: "{{ trigger.payload }}"

    - service: mqtt.publish
      data_template:
        topic: "milight/states/0xAAAB/rgb_cct/1"
        payload_template: "{{ trigger.payload }}"

    - service: mqtt.publish
      data_template:
        topic: "milight/states/0xAAAB/rgb_cct/2"
        payload_template: "{{ trigger.payload }}"

I have 7 of those patches.

So I can imagine when Circadian Light is on for "bedroom_all" (group 0) it will update all others, but also all individual lights will controlled by Circadian Light.

There is a bunch of MQTT traffic I guess

poudenes avatar Feb 28 '20 20:02 poudenes

I tried several times with a group command in HA to turn on all groups 0 to on with circadian light correction on. Free heap drops from 15608 Bytes to 12560 Bytes.

Then I hit the group command in HA to turn on and off directly all group 0 groups of Milight Hub On/off on/off 10 times. Takes some time when HA was done with this but then Free heap drops to 11356 Bytes

So I can't reproduce it with Circadian Light correction on in HA. Then this cant be the problem right?

poudenes avatar Feb 29 '20 07:02 poudenes

I left the device name empty as well. Turned on all Circadian Light and there is nothing going wrong. Still have the settings in the hub.

poudenes avatar Mar 03 '20 11:03 poudenes

settings (7nIDs).json.txt

I might have the same issue on multiple locations. To reproduce something similar: -Configure at least 16-20 device ids and device aliases -reboot

Example: My attached config will restore only if I remove at last 3 device alias from the end of file. After reboot I can add a few more with the gui but if I reboot again, the settings will not load correctly. At least the device will keep the network, so restoring a working config is possible remotely.

csigabit avatar Mar 17 '20 16:03 csigabit

@csigabit, that's probably a different issue: #569.

(Open ticket for solution is #577)

sidoh avatar Mar 22 '20 18:03 sidoh

Thanks @poudenes, appreciate you looking further into it. Apologies for the slow replies; I've not had the ability to focus on this project recently.

Anything that reproduces what looks to be a memory leak is useful, and it appears that you've done that. I'll give this a shot.

sidoh avatar Mar 22 '20 18:03 sidoh

I've got circadian_lighting setup in a test environment with a couple of lights.

I attempted to fix the biggest source of dynamic memory allocation I'm aware of (the packet queue). Changed it to a fixed-size circular buffer.

The other culprit is the transition handler, but it'll be significantly more challenging to refactor.

Will leave this running for a bit and see what the heap trend is. Starting point:

{"firmware":"milight-hub","version":"1.10.5","ip_address":"10.133.8.194","reset_reason":"Software/System restart","variant":"d1_mini","free_heap":16464,"arduino_version":"2_4_2","queue_stats":{"length":0,"dropped_packets":0}}

sidoh avatar Mar 23 '20 03:03 sidoh

Its somewhere in the transitions or the load it creates on the packet queue imho. Disabled all transistions just to try. I'm now +12 days uptime without issues (instead of 8-9 before acting up)

Packet heap is still great. reset reason is weird though.

{"firmware":"milight-hub","version":"1.10.5","ip_address":"192.168.1.24","reset_reason":"Exception","variant":"nodemcuv2","free_heap":15480,"arduino_version":"2_4_2","queue_stats":{"length":0,"dropped_packets":63}}

idfxken avatar Apr 10 '20 07:04 idfxken

Yeah, that's tracking with what I've seen too.

Transitions do dynamically allocate memory, so this is either a leak or heap fragmentation. Full disclosure, it'll probably be a little while before I have the time to look more deeply into this, but it's really helpful to know where to look.

sidoh avatar Apr 11 '20 18:04 sidoh

I think I found a pretty nasty source of memory leak, haven't been able to fix it yet. But it's in the state store lookup. Even spamming GET on /gateways triggers it. Will be investigating.

sidoh avatar Apr 27 '20 00:04 sidoh

Update (keeping notes for my own sake):

Appears to be in GroupStateCache. Still don't see where it could possibly be happening, but if I effectively turn the cache off, the problem goes away. Have verified that it's not going crazy creating nodes or GroupState objects.

sidoh avatar Apr 27 '20 17:04 sidoh

Also worth mentioning: this memory leak has been around for ages -- ever since state tracking was introduced. Only seems to appear with a large number of groups (I'm testing with 20 devices and 8 groups). Unclear if this is the problem given it's an older bug, but:

  • Newer versions have higher fixed memory usage (free heap is ~18kb vs ~25kb)
  • Newer versions support features with much higher packet throughput (async, transitions)

sidoh avatar Apr 27 '20 17:04 sidoh

Same problem! I have 12 device_ids -if i try add one more - all setting erase..(( изображение

itProfi avatar May 04 '20 18:05 itProfi

Guys, I apologize for the lack of progress on this issue. Fixing this sort of thing takes a level of time and energy that I've unfortunately not had to give this project in a while.

As a really shitty bandaid -- I expect that using the setting to have the hub restart itself every ~day would at least stop the settings from clearing themselves.

sidoh avatar Jul 01 '20 06:07 sidoh

@sidoh , I'm revisiting this issue after many months of holding off on updating and staying on 1.10.0-dev.3 (one of the latest versions that worked).

I've been running esp8266_milight_hub_nodemcuv2-1.10.7 for about a week now. Initially I was testing with flux disabled since that's where the problem seemed most prevalent. I ran this for about 3 days with heap size staying stable around 16000.

After this I enabled flux, which caused the heapsize to almost drop instantly and never recover (with only 1 light on/in use). I had a look at MQTT to see exactly what's going on. This is the command generated from flux according to the command MQTT topic: {"state": "ON", "color": {"r": 255, "g": 108, "b": 0}, "transition": 1.0}, this repeats around every 30 seconds.

I disabled flux, looked at my heap, and every time I issue the command manually via MQTT I can see the heap size drop with almost the same amount of bytes. This doesn't ALWAYS seem to happen, if I restart the Hub, issue the same command, usually it just works... Until it doesn't, and then every new command will drop the heap further.

Now here's an interesting bit I discovered. I had a look around the REST documentation and noticed the /transitions commands and the ability to delete transitions. The /transitions call in my case returned a lot of records. This when at that time there was only 1 light turned on, currently not transitioning into anything. Even more interesting, for every record the 'period' was a large number which I don't think came from hass/flux. I assume the /transitions should not return anything when no transitions are happening, correct?

Guess what happened when I started deleting these transitions... image

After clearing it's been stable with no "stuck' transitions showing when calling /transitions. I'll keep on the heap/transitions and check if I can reproduce this behavior.

If you've got any other ideas/things I could try any help is appreciated.

MunkeyBalls avatar Feb 24 '21 21:02 MunkeyBalls

Hm, interesting. Can you paste the output of /transitions? I can't think of a reason transitions would get created out of band from the API.

sidoh avatar Feb 24 '21 22:02 sidoh

Here's an example, note that the one with period 500 disappears when querying again, as I would expect. The ones with period 4294967295 seem to just remain there until I manually clear them, which bumps the free heap back up.

{
  "transitions": [
    {
      "id": 1520,
      "period": 4294967295,
      "last_sent": 0,
      "bulb": {
        "device_id": 1,
        "group_id": 1,
        "device_type": "rgb_cct"
      },
      "type": "color",
      "current_color": [
        255,
        167,
        91
      ],
      "end_color": [
        255,
        167,
        91
      ],
      "step_sizes": [
        0,
        0,
        0
      ]
    },
    {
      "id": 1526,
      "period": 4294967295,
      "last_sent": 0,
      "bulb": {
        "device_id": 1,
        "group_id": 1,
        "device_type": "rgb_cct"
      },
      "type": "color",
      "current_color": [
        255,
        167,
        91
      ],
      "end_color": [
        255,
        167,
        91
      ],
      "step_sizes": [
        0,
        0,
        0
      ]
    },
    {
      "id": 1532,
      "period": 4294967295,
      "last_sent": 0,
      "bulb": {
        "device_id": 1,
        "group_id": 1,
        "device_type": "rgb_cct"
      },
      "type": "color",
      "current_color": [
        255,
        167,
        91
      ],
      "end_color": [
        255,
        167,
        91
      ],
      "step_sizes": [
        0,
        0,
        0
      ]
    },
    {
      "id": 1549,
      "period": 500,
      "last_sent": 59762040,
      "bulb": {
        "device_id": 5,
        "group_id": 1,
        "device_type": "rgbw"
      },
      "type": "change_on_finish",
      "field": "status",
      "value": 1,
      "child": {
        "type": "field",
        "field": "level",
        "current_value": 0,
        "end_value": 0,
        "step_size": -25
      }
    }
  ]
}

MunkeyBalls avatar Feb 25 '21 12:02 MunkeyBalls

I just wrote a node-red flow that triggers when free heap gets below a certain value. It will fetch the current transitions with the help of /transitions, and then call the delete function for every item in the list. While doing this I noticed that the transitions call only returns around 4 items, even though there seem to be more since I had to run my command multiple times to get everything cleared. I noticed this as well when doing it manually, but initially thought I was just too slow.

It seems to be working great however, guess where it triggered: image

Ill do some further testing and share the flow in case someone else needs a workaround.

MunkeyBalls avatar Feb 25 '21 17:02 MunkeyBalls

That's very helpful, thanks. If we look at the color transitions, we see this:

      "current_color": [
        255,
        167,
        91
      ],
      "end_color": [
        255,
        167,
        91
      ],
      "step_sizes": [
        0,
        0,
        0
      ]

Which means it's trying to transition from a color to itself. That's why the period is enormous as well (it's the maximum value of a signed integer).

I think the fix here is to just change this:

https://github.com/sidoh/esp8266_milight_hub/blob/master/lib/Transitions/ColorTransition.cpp#L136

to

bool ColorTransition::isFinished() {
  return currentColor == endColor;
}

sidoh avatar Feb 25 '21 17:02 sidoh

Flashed the new firmware and will report back with results, thanks again for your time!

MunkeyBalls avatar Feb 25 '21 17:02 MunkeyBalls

Just to make sure I'm clear, I did not make this change yet. 😅

sidoh avatar Feb 25 '21 17:02 sidoh

Just to make sure I'm clear, I did not make this change yet. 😅

But I did ;)

MunkeyBalls avatar Feb 25 '21 17:02 MunkeyBalls

Excellent! Thanks for testing it out. Feel free to open a PR if it does the trick.

sidoh avatar Feb 25 '21 17:02 sidoh

The fix is working great!

Transitions are no longer getting stuck and heap-size is staying stable around 14000-16000

MunkeyBalls avatar Feb 26 '21 07:02 MunkeyBalls