esp-matter
esp-matter copied to clipboard
light-demo has slow memory leak (CON-817)
I boot light demo let it stabilize with a wildcard report monitoring it.
Totals:
free 81128 allocated 148396 min_free 63104 largest_free_block 65536
then five minutes later
Totals:
free 81132 allocated 148392 min_free 63104 largest_free_block 65536
i took the wifi router up and down
Totals:
free 81528 allocated 148028 min_free 63104 largest_free_block 65536
then I let it sit for two days. During those two days the monitoring wild card lost contact and then reestablished it several times. The wildcard app is on my phone, so when I went out of range of the router it lost contact.
Totals:
free 76084 allocated 153076 min_free 18780 largest_free_block 65536
It is slowly allocating more and more memory so something is leaking. I went from 148396 to 153076 over the course of two days. So I have lost 4680 bytes of memory. I think this is behind the packet buffer allocation failures.
I attached the change I made to light demo and the sdkconfig. sdkconfig.txt light.diff.txt esp-matter 1cacb8b2fcd42c465e38317d3a63d0169460b397 I am using an esp32 dev board with PSRAM esp-idf e088c3766ba440e72268b458a68f27b6e7d63986
In another experiment I turned on the leak tracking support in the IDF. I could not figure out how to get useful data out of it because I don't have a good place to start/stop the data collection. leak.txt That trace is from my code which also has BT turned on. The start/stop was done about three hours apart.
You can see what the problem is. I can't tell allocations which are two hours old from transient allocations which will be less than a second old. Many of those entries are not leaks, they are transient allocations coming from other tasks in the system like wifi. Those transient allocations just happened to be active when I did STOP on the leak detector. The dump info needs to include the age of the allocation. Once you have the age, the oldest ones will be the likely leaks. A two hour age will be a leak, a 1 second age transient.
BTW, I don't think (not 100% sure) it leaks just sitting there. The leaks appear to be something either with the router disconnect/reconnect or the wild card monitor disconnect/reconnect.
I ran for two more days. It does not appear to leak just sitting idle if none of the connections are disrupted. This implies the leak has to be related to various disruptions like disconnect/reconnect events. When I first observed the leak I was starting and stopping my Android app all of the time.
This could be a potential leak source: https://github.com/espressif/esp-matter/issues/656
I have not added enough debug tracking to know.
There is definitely a slow leak in light demo code. I have increased the queue size so that I don't run out of queue space anymore.
When I first booted:
free 148716 allocated 99252 min_free 58048 largest_free_block 65536
I (14027469) app_main: Free internal 148720 biggest block 65536 total 141948
then 3.6 days later
I (318059696) app_main: Free internal 139116 biggest block 65536 total 132344
free 139124 allocated 107204 min_free 55780 largest_free_block 65536
So over the course of 3.6 days something in the light demo slowly allocated 7,952 bytes of memory. What I can't easily tell is if this is due to a few larger allocation, or many small ones. That's because there is a constant churn of things being allocated and freed. When I randomly do snapshots the snapshots contains these transient objects so the trend line has a lot of noise in it.
One solution would be to modify the the ESP heap debugging code to track the age of objects, then you could ignore all of the initial and recent allocations. What you want to say is -- dump me all objects over an hour old and less than three days old. Those will be the leaks. There are only about 600 persistent allocations involved so with PSRAM you have plenty of memory to track things in detail.
I had light example running again for 3.9 days. I need to reboot it to change config. I am running light example using common ESP32 dev boards since I have a lot of those.
Start 146,852 allocated End 153,924 allocated
7,024 lost. For sure there is noise in these numbers due to wifi activity. Previous run lost 7,952 in 3.6 days. Previous run was likely interrupted when it had some wifi activity going on. But there is a definite trend line due to leaks which keeps going up.
Using...
commit c707e6680ac39df56b50addfe6792d49a60f2f90 (origin/main, origin/HEAD, main)
Merge: bbdf797 9e53da8
Author: Shu Chen <[email protected]>
Date: Wed Oct 18 16:10:33 2023 +0800
@jonsmirl We are narrowing down the source of the leak
I am not seeing leaks in esp-matter anymore.