esp8266_milight_hub
esp8266_milight_hub copied to clipboard
Memory leak - causes settings to be randomly cleared
Describe the bug
After restore settings I see all the information. But after a while everything is gone in WebUI, strange enough... everything is still working inside HA to control the lights
Steps to reproduce
Restore settings, after while goto the WebGUI and everything is gone again.
Expected behavior
After restore all settings keep in memory.
Additional info
I don't use anymore device name because that gives some issues as well.
Firmware version
1.10.5 (nodemcuv2)
Output of http://milight-hub.local/about: When WebGUI is empty
{"firmware":"milight-hub","version":"1.10.5","ip_address":"192.168.xxx.xxx","reset_reason":"Software/System restart","variant":"nodemcuv2","free_heap":6616,"arduino_version":"2_4_2","queue_stats":{"length":0,"dropped_packets":0}}
Output of http://milight-hub.local/about: When WebGUI is normal
{"firmware":"milight-hub","version":"1.10.5","ip_address":"192.168.xxx.xxx","reset_reason":"Software/System restart","variant":"nodemcuv2","free_heap":16016,"arduino_version":"2_4_2","queue_stats":{"length":0,"dropped_packets":0}}
Output of http://milight-hub.local/settings When WebGUI is empty
{}
Output of http://milight-hub.local/settings When WebGUI is normal
{"admin_username":"","admin_password":"","ce_pin":16,"csn_pin":15,"reset_pin":0,"led_pin":-2,"radio_interface_type":"nRF24","packet_repeats":50,"http_repeat_factor":1,"auto_restart_period":0,"mqtt_server":"192.168.xxx.xxx:1883","mqtt_username":"USERNAME","mqtt_password":"PASSWORD","mqtt_topic_pattern":"milight/:device_id/:device_type/:group_id","mqtt_update_topic_pattern":"milight/updates/:hex_device_id/:device_type/:group_id","mqtt_state_topic_pattern":"milight/states/:hex_device_id/:device_type/:group_id","mqtt_client_status_topic":"milight/client_status","simple_mqtt_client_status":true,"discovery_port":48899,"listen_repeats":0,"state_flush_interval":100,"mqtt_state_rate_limit":500,"packet_repeat_throttle_sensitivity":50,"packet_repeat_throttle_threshold":500,"packet_repeat_minimum":50,"enable_automatic_mode_switching":false,"led_mode_wifi_config":"Fast toggle","led_mode_wifi_failed":"Fast blip","led_mode_operating":"Slow toggle","led_mode_packet":"Fast blip","led_mode_packet_count":4,"hostname":"MilightHub","rf24_power_level":"MAX","rf24_listen_channel":"LOW","wifi_static_ip":"192.68.xxx.xxx","wifi_static_ip_gateway":"192.168.xxx.xxx","wifi_static_ip_netmask":"255.255.255.0","packet_repeats_per_loop":50,"home_assistant_discovery_prefix":"","wifi_mode":"n","default_transition_period":500,"rf24_channels":["LOW","MID","HIGH"],"device_ids":[43681,43682,43683,43684,43690,43691,43692,43693,43694,43695,43706,43707],"gateway_configs":[],"group_state_fields":["state","status","brightness","level","hue","saturation","color","mode","kelvin","color_temp","bulb_mode","computed_color","effect","device_id","group_id","device_type"],"group_id_aliases":{}}
Thanks for reporting and providing the info. Left a comment on #601, guessing this is related.
This is very interesting:
{"free_heap":6616}
that's critically low and very likely to cause funky problems. Validates the hypothesis that there's a leak and something screwy is happening when heap is low.
Next steps are:
- Find what's causing settings to clear when heap is low, add as much guarding around that as possible.
- Potentially add auto-restart logic if heap gets too low
- Find the source of the leak (my best guess is the packet queue, which is the only place i can think of where memory is regularly allocated dynamically)
- Figure out ways to clear up heap. 16KB steady-state is pretty crappy.
only thing what I can think about is next: I use Circadian Light option in HA. When I have some lights on and reboot my HA because of some changes, it will send lots of data to Hub.
When HA is booted, it will change all lights to right Kelvin, Brightness. In meanwhile it also receive all kind of mqtt information because its update the settings in HA also.
I realise that this clear settings never happend before (only at the moment I tried to add for every light a device name) but leave that and everything was ok. Last moment I try the Circadian Light option for a better healthy life... I will disable that for while and do same things (reboot HA sometimes because of changes). See what happen
Sorry for my bad explanation in English.... :) Im a dutchie!
Definitely interesting. If you're able to find a way to reliably reproduce a memory leak (just check free_heap
from /about
), that'd be a huge help. Perhaps restarting HA a few times?
@sidoh When im 2 days off I will do some tests. If HA is fail then I have time to repair it before my partner gets home in a dark house :)
Great, much appreciated.
Hah! I can relate. Can't tell you how many times I've accidentally turned the lamp in the other room purple while testing things.
If it's possible for you to share your HA configs, or instructions on how to set up Circadian Light, that might help as well.
How many milight groups to you have set up in HA?
@sidoh Here's my configuration: https://github.com/poudenes/homeassistant Here example to add Circadian Light
###########################################################################################
#
# CIRCADIAN LIGHTING
#
# https://github.com/claytonjn/hass-circadian_lighting
#
###########################################################################################
---
platform: circadian_lighting
name: "Circadian Bathroom"
lights_ct:
- light.bathroom_all
disable_brightness_adjust: true
I have 32 individual groups in HA and 6 group 0 in HA
Im using the Patch for correct updates using MQTT from HA to Milight Hub:
###########################################################################################
#
# PATCH WHEN GROUP 0 IS ON OR OFF THE CORRECT STATUS WILL SEND TO MILIGHT HUB FOR ALL LIGHTS INSIDE GROUP
#
###########################################################################################
- alias: "Patches - MiLight Bedroom Group 0"
initial_state: "true"
trigger:
platform: mqtt
topic: "milight/states/0xAABA/rgb_cct/1"
action:
- service: mqtt.publish
data_template:
topic: "milight/states/0xAAAA/rgb_cct/1"
payload_template: "{{ trigger.payload }}"
- service: mqtt.publish
data_template:
topic: "milight/states/0xAAAA/rgb_cct/2"
payload_template: "{{ trigger.payload }}"
- service: mqtt.publish
data_template:
topic: "milight/states/0xAAAA/rgb_cct/3"
payload_template: "{{ trigger.payload }}"
- service: mqtt.publish
data_template:
topic: "milight/states/0xAAAA/rgb_cct/4"
payload_template: "{{ trigger.payload }}"
- service: mqtt.publish
data_template:
topic: "milight/states/0xAAAB/rgb_cct/1"
payload_template: "{{ trigger.payload }}"
- service: mqtt.publish
data_template:
topic: "milight/states/0xAAAB/rgb_cct/2"
payload_template: "{{ trigger.payload }}"
I have 7 of those patches.
So I can imagine when Circadian Light is on for "bedroom_all" (group 0) it will update all others, but also all individual lights will controlled by Circadian Light.
There is a bunch of MQTT traffic I guess
I tried several times with a group command in HA to turn on all groups 0 to on with circadian light correction on. Free heap drops from 15608 Bytes to 12560 Bytes.
Then I hit the group command in HA to turn on and off directly all group 0 groups of Milight Hub On/off on/off 10 times. Takes some time when HA was done with this but then Free heap drops to 11356 Bytes
So I can't reproduce it with Circadian Light correction on in HA. Then this cant be the problem right?
I left the device name empty as well. Turned on all Circadian Light and there is nothing going wrong. Still have the settings in the hub.
I might have the same issue on multiple locations. To reproduce something similar: -Configure at least 16-20 device ids and device aliases -reboot
Example: My attached config will restore only if I remove at last 3 device alias from the end of file. After reboot I can add a few more with the gui but if I reboot again, the settings will not load correctly. At least the device will keep the network, so restoring a working config is possible remotely.
@csigabit, that's probably a different issue: #569.
(Open ticket for solution is #577)
Thanks @poudenes, appreciate you looking further into it. Apologies for the slow replies; I've not had the ability to focus on this project recently.
Anything that reproduces what looks to be a memory leak is useful, and it appears that you've done that. I'll give this a shot.
I've got circadian_lighting setup in a test environment with a couple of lights.
I attempted to fix the biggest source of dynamic memory allocation I'm aware of (the packet queue). Changed it to a fixed-size circular buffer.
The other culprit is the transition handler, but it'll be significantly more challenging to refactor.
Will leave this running for a bit and see what the heap trend is. Starting point:
{"firmware":"milight-hub","version":"1.10.5","ip_address":"10.133.8.194","reset_reason":"Software/System restart","variant":"d1_mini","free_heap":16464,"arduino_version":"2_4_2","queue_stats":{"length":0,"dropped_packets":0}}
Its somewhere in the transitions or the load it creates on the packet queue imho. Disabled all transistions just to try. I'm now +12 days uptime without issues (instead of 8-9 before acting up)
Packet heap is still great. reset reason is weird though.
{"firmware":"milight-hub","version":"1.10.5","ip_address":"192.168.1.24","reset_reason":"Exception","variant":"nodemcuv2","free_heap":15480,"arduino_version":"2_4_2","queue_stats":{"length":0,"dropped_packets":63}}
Yeah, that's tracking with what I've seen too.
Transitions do dynamically allocate memory, so this is either a leak or heap fragmentation. Full disclosure, it'll probably be a little while before I have the time to look more deeply into this, but it's really helpful to know where to look.
I think I found a pretty nasty source of memory leak, haven't been able to fix it yet. But it's in the state store lookup. Even spamming GET
on /gateways
triggers it. Will be investigating.
Update (keeping notes for my own sake):
Appears to be in GroupStateCache. Still don't see where it could possibly be happening, but if I effectively turn the cache off, the problem goes away. Have verified that it's not going crazy creating nodes or GroupState objects.
Also worth mentioning: this memory leak has been around for ages -- ever since state tracking was introduced. Only seems to appear with a large number of groups (I'm testing with 20 devices and 8 groups). Unclear if this is the problem given it's an older bug, but:
- Newer versions have higher fixed memory usage (free heap is ~18kb vs ~25kb)
- Newer versions support features with much higher packet throughput (async, transitions)
Same problem! I have 12 device_ids -if i try add one more - all setting erase..((
Guys, I apologize for the lack of progress on this issue. Fixing this sort of thing takes a level of time and energy that I've unfortunately not had to give this project in a while.
As a really shitty bandaid -- I expect that using the setting to have the hub restart itself every ~day would at least stop the settings from clearing themselves.
@sidoh , I'm revisiting this issue after many months of holding off on updating and staying on 1.10.0-dev.3 (one of the latest versions that worked).
I've been running esp8266_milight_hub_nodemcuv2-1.10.7 for about a week now. Initially I was testing with flux disabled since that's where the problem seemed most prevalent. I ran this for about 3 days with heap size staying stable around 16000.
After this I enabled flux, which caused the heapsize to almost drop instantly and never recover (with only 1 light on/in use). I had a look at MQTT to see exactly what's going on. This is the command generated from flux according to the command MQTT topic: {"state": "ON", "color": {"r": 255, "g": 108, "b": 0}, "transition": 1.0}, this repeats around every 30 seconds.
I disabled flux, looked at my heap, and every time I issue the command manually via MQTT I can see the heap size drop with almost the same amount of bytes. This doesn't ALWAYS seem to happen, if I restart the Hub, issue the same command, usually it just works... Until it doesn't, and then every new command will drop the heap further.
Now here's an interesting bit I discovered. I had a look around the REST documentation and noticed the /transitions commands and the ability to delete transitions. The /transitions call in my case returned a lot of records. This when at that time there was only 1 light turned on, currently not transitioning into anything. Even more interesting, for every record the 'period' was a large number which I don't think came from hass/flux. I assume the /transitions should not return anything when no transitions are happening, correct?
Guess what happened when I started deleting these transitions...
After clearing it's been stable with no "stuck' transitions showing when calling /transitions. I'll keep on the heap/transitions and check if I can reproduce this behavior.
If you've got any other ideas/things I could try any help is appreciated.
Hm, interesting. Can you paste the output of /transitions
? I can't think of a reason transitions would get created out of band from the API.
Here's an example, note that the one with period 500 disappears when querying again, as I would expect. The ones with period 4294967295 seem to just remain there until I manually clear them, which bumps the free heap back up.
{
"transitions": [
{
"id": 1520,
"period": 4294967295,
"last_sent": 0,
"bulb": {
"device_id": 1,
"group_id": 1,
"device_type": "rgb_cct"
},
"type": "color",
"current_color": [
255,
167,
91
],
"end_color": [
255,
167,
91
],
"step_sizes": [
0,
0,
0
]
},
{
"id": 1526,
"period": 4294967295,
"last_sent": 0,
"bulb": {
"device_id": 1,
"group_id": 1,
"device_type": "rgb_cct"
},
"type": "color",
"current_color": [
255,
167,
91
],
"end_color": [
255,
167,
91
],
"step_sizes": [
0,
0,
0
]
},
{
"id": 1532,
"period": 4294967295,
"last_sent": 0,
"bulb": {
"device_id": 1,
"group_id": 1,
"device_type": "rgb_cct"
},
"type": "color",
"current_color": [
255,
167,
91
],
"end_color": [
255,
167,
91
],
"step_sizes": [
0,
0,
0
]
},
{
"id": 1549,
"period": 500,
"last_sent": 59762040,
"bulb": {
"device_id": 5,
"group_id": 1,
"device_type": "rgbw"
},
"type": "change_on_finish",
"field": "status",
"value": 1,
"child": {
"type": "field",
"field": "level",
"current_value": 0,
"end_value": 0,
"step_size": -25
}
}
]
}
I just wrote a node-red flow that triggers when free heap gets below a certain value. It will fetch the current transitions with the help of /transitions, and then call the delete function for every item in the list. While doing this I noticed that the transitions call only returns around 4 items, even though there seem to be more since I had to run my command multiple times to get everything cleared. I noticed this as well when doing it manually, but initially thought I was just too slow.
It seems to be working great however, guess where it triggered:
Ill do some further testing and share the flow in case someone else needs a workaround.
That's very helpful, thanks. If we look at the color transitions, we see this:
"current_color": [
255,
167,
91
],
"end_color": [
255,
167,
91
],
"step_sizes": [
0,
0,
0
]
Which means it's trying to transition from a color to itself. That's why the period is enormous as well (it's the maximum value of a signed integer).
I think the fix here is to just change this:
https://github.com/sidoh/esp8266_milight_hub/blob/master/lib/Transitions/ColorTransition.cpp#L136
to
bool ColorTransition::isFinished() {
return currentColor == endColor;
}
Flashed the new firmware and will report back with results, thanks again for your time!
Just to make sure I'm clear, I did not make this change yet. 😅
Just to make sure I'm clear, I did not make this change yet. 😅
But I did ;)
Excellent! Thanks for testing it out. Feel free to open a PR if it does the trick.
The fix is working great!
Transitions are no longer getting stuck and heap-size is staying stable around 14000-16000