NimBLE-Arduino icon indicating copy to clipboard operation
NimBLE-Arduino copied to clipboard

NimBLE_Scan_Continuous.ino possible heap memory leak?

Open nravanelli opened this issue 1 year ago • 19 comments

I have been noticing a potential memory leak in the continuous scan example. I have only added the following to the end of void loop():

Serial.print("HEAP: ");
Serial.println(ESP.getFreeHeap());
delay(2000);

And I see heap memory use reduce continuously:

14:15:03.247 -> HEAP: 190336
14:15:05.233 -> HEAP: 190208
14:15:07.242 -> HEAP: 190208
14:15:09.245 -> HEAP: 190208
14:15:11.230 -> HEAP: 190208
14:15:13.242 -> HEAP: 190208
14:15:15.221 -> HEAP: 190080
14:15:17.219 -> HEAP: 190080
14:15:19.245 -> HEAP: 190080
14:15:21.244 -> HEAP: 189952
14:15:23.245 -> HEAP: 189952
14:15:25.247 -> HEAP: 189952
14:15:27.215 -> HEAP: 189952
14:15:29.229 -> HEAP: 189952
14:15:31.248 -> HEAP: 189952
14:15:33.254 -> HEAP: 189952
14:15:35.086 -> Advertised Device: Name: , Address: 34:cc:2b:5e:5b:86, manufacturer data: 4c000906036a0a640627 
14:15:35.211 -> HEAP: 189952
14:15:37.245 -> HEAP: 189952
14:15:38.702 -> Advertised Device: Name: , Address: e8:fc:60:39:21:0b, manufacturer data: 4c00121900e46fc9dc373dd41da5665fc8d8293373c06494dc5b4d0100 
14:15:39.217 -> HEAP: 189824
14:15:41.251 -> HEAP: 189824
14:15:43.235 -> HEAP: 189824

Is there something that is not being cleared properly in the example?

nravanelli avatar Sep 28 '23 18:09 nravanelli

Sorry, I'm not very experienced with BLE and have problems of my own, but here's what I've noticed: It looks like it always decreases by 72 bytes every 16~18 seconds or so whether or not a new advertised device is found. It's a pretty simple example program in my opinion. The only call I'm not familiar with is; NimBLEDevice::setScanDuplicateCacheSize(200);

I've never used this function, but it seems like it would be used to save the previously discovered mac addresses to compare against newly discovered mac addresses. Looking at the code backs that up. I see nothing obvious (which is pretty much all I would notice given my programming skills), but I'm guessing it has to do with the duplicate entries, which are not being reported due to being duplicates, are somehow still taking up heap space.

Sorry I can't help more than that. If you're up to it, here's the part of the code where the function is located. Maybe you can track it down if you really need: Starts on line 503. https://github.com/h2zero/NimBLE-Arduino/blob/release/1.4/src/NimBLEDevice.cpp

Before I go, I just noticed the comment above that function... here's the whole block from the above link:

/**
 * @brief Set the duplicate filter cache size for filtering scanned devices.
 * @param [in] cacheSize The number of advertisements filtered before the cache is reset.\n
 * Range is 10-1000, a larger value will reduce how often the same devices are reported.
 * @details Must only be called before calling NimBLEDevice::init.
 */
/*STATIC*/
void NimBLEDevice::setScanDuplicateCacheSize(uint16_t cacheSize) {
    if(initialized) {
        NIMBLE_LOGE(LOG_TAG, "Cannot change scan cache size while initialized");
        return;
    } else if(cacheSize > 1000 || cacheSize <10) {
        NIMBLE_LOGE(LOG_TAG, "Invalid scan cache size; min=10 max=1000");
        return;
    }

    m_scanDuplicateSize = cacheSize;
}

"* @param [in] cacheSize The number of advertisements filtered before the cache is reset.\n" The way this is worded makes me think each advertisement, whether it is the first or tenth time seeing it, gets added to the cache. Then after 200 duplicates, the cache gets dumped and the merry-go-round starts again. Each duplicate is recorded, so you could be seeing 100 advertisements from one nearby device taking up cache/heap but not being reported because it's not new.

If you try a smaller cache size or watch the heap longer, you might see the cache get dumped and restart. Then the heap would suddenly get smaller before looking like it is leaking again. At least that's the impression I get from that comment, and it backs up my initial guess. If it is recording duplicate advertisements to the cache instead of deleting duplicates, that would explain this behavior. Maybe a better method would involve deleting duplicates instead of just adding them to the cache if this is indeed the case. Sorry I can't help more than this, but it may be the intended functionality. We'll have to wait for word from h2zero.

Thanks, I can't say i've noticed but I will test and get back

h2zero avatar Oct 31 '23 19:10 h2zero

I've also noticed something similar. I recently migrated 5 devices to use NimBLE instead of the "native" Arduino BLE and all devices are exhibiting the same behaviour. I have put together a sample of the heap size over the last ~12 hours here.

It's worth noting that I know it's related to the NimBLE refactor because it's the only thing that was changed between these releases.

The relevant code is:

void connectToThermometer()
{
  NimBLEDevice::init("");
  NimBLEDevice::setPower(ESP_PWR_LVL_P9);

  pBLEScan = NimBLEDevice::getScan();
  pBLEScan->setAdvertisedDeviceCallbacks(new MyAdvertisedDeviceCallbacks(), false);
  pBLEScan->setInterval(97);
  pBLEScan->setWindow(37);
  pBLEScan->setMaxResults(0);
  pBLEScan->setActiveScan(true);
  startBLE();
}

void startBLE()
{
  pBLEScan->start(0, nullptr);
}

bool isBLEStopped()
{
  return pBLEScan->isScanning() == false;
}

inside the loop:

...
  if (isBLEStopped())
  {
    Serial.println('scanning failed, restarting ble');
    startBLE();
    delay(2000);
  }
...

scripter-co avatar Nov 10 '23 10:11 scripter-co

looking into this a bit further, it seems that m_advertisedDevicesVector inside of NimBLEScan.cpp slowly is increasing over time.

I've added logs into the file:

                 NIMBLE_LOGI(LOG_TAG, "New advertiser: %s", advertisedAddress.toString().c_str());
+                NIMBLE_LOGI(LOG_TAG, "Number of vectors: %d", pScan->m_scanResults.m_advertisedDevicesVector.size());
             } else if (advertisedDevice != nullptr) {
                 NIMBLE_LOGI(LOG_TAG, "Updated advertiser: %s", advertisedAddress.toString().c_str());

and i can see that over a while, this number slowly increases (using the code shown above).

I added the following:

diff --git a/NimBLE-Arduino-1.4.1/src/NimBLEScan.cpp b/NimBLE-Arduino-1.4.1/src/NimBLEScan.cpp
index d1c4879..bce89a1 100644
--- a/NimBLE-Arduino-1.4.1/src/NimBLEScan.cpp
+++ b/NimBLE-Arduino-1.4.1/src/NimBLEScan.cpp
@@ -23,7 +23,7 @@
 #include <climits>

 static const char* LOG_TAG = "NimBLEScan";
-
+static const size_t maxAdvertisedDevices = 2;

 /**
  * @brief Scan constuctor.
@@ -110,6 +110,11 @@ NimBLEScan::~NimBLEScan() {
                     return 0;
                 }

+                if (pScan->m_scanResults.m_advertisedDevicesVector.size() >= maxAdvertisedDevices) {
+                    delete pScan->m_scanResults.m_advertisedDevicesVector.front();
+                    pScan->m_scanResults.m_advertisedDevicesVector.erase(pScan->m_scanResults.m_advertisedDevicesVector.begin());
+                }
+
                 advertisedDevice = new NimBLEAdvertisedDevice();
                 advertisedDevice->setAddress(advertisedAddress);
                 advertisedDevice->setAdvType(event_type, isLegacyAdv);

to test the theory that this was the issue and it did indeed make the memory to stabilise, so everything seems to be pointing to this vector being the problem (and old entries not being deleted + erased correctly). The patch above is obviously not a permanent fix but should help to implement a long term solution.

scripter-co avatar Nov 15 '23 01:11 scripter-co

I have looked into this and the cause is due to active scanning and how it is handled. When a device is advertising with a scan response available the device is added to the devices vector and awaits the scan response data before calling the callback and is not deleted from the vector until the scan response is received. The problem occurs when the scan response is never received, which happens quite regularly. To work around this (for now) I suggest stopping and restarting the scan occasionally to clear out the vector, or do not enable active scanning. I will contemplate options for the future to resolve this internally.

h2zero avatar Nov 17 '23 15:11 h2zero

Thanks @h2zero for giving more clarity here. I need to maintain active scanning for various reasons, so would the temporary solution be to simply call pBLEScan->stop(); after each active scan? or should we restart the BLE stack completely?

In the documentation, the final pBLEScan->start() object states that when false, it clears the previous scan... is that not correct or does it simply not reuse the previous scan results?

nravanelli avatar Nov 17 '23 16:11 nravanelli

I would suggest using a timer that triggers say every 15 or 20 mins and in the callback stop the scan with pBLEScan->stop();.

In the documentation, the final pBLEScan->start() object states that when false, it clears the previous scan... is that not correct or does it simply not reuse the previous scan results?

The documentation is correct, when the is_continue parameter is false the vector will be cleared before the scan is started.

I would also add that you will need to delay for a couple milliseconds after stopping the scan before restarting for the stack to cleanup.

h2zero avatar Nov 17 '23 16:11 h2zero

I have looked into this and the cause is due to active scanning and how it is handled. When a device is advertising with a scan response available the device is added to the devices vector and awaits the scan response data before calling the callback and is not deleted from the vector until the scan response is received. The problem occurs when the scan response is never received, which happens quite regularly. To work around this (for now) I suggest stopping and restarting the scan occasionally to clear out the vector, or do not enable active scanning. I will contemplate options for the future to resolve this internally.

That was the conclusion I came to as well. Would an option to clear out items from the vector that haven't advertised in N seconds?

I would suggest using a timer that triggers say every 15 or 20 mins and in the callback stop the scan with pBLEScan->stop();.

The reason I migrated to NimBLE is because i found that starting and stopping BLE on the ESP32 eventually causes a OWDT, restart (and using HTTP + BLE at the same time wasn't possible without using NimBLE so had to stop BLE before making a HTTP request). I'd half narrowed it down to the continuous starting/stopping of BLE (although I was doing this every 1 minute instead of 15/20 as suggested in this thread).

For me, I can disable active scanning for now, see how the devices behave and report back. Obviously this doesn't help @nravanelli.

scripter-co avatar Nov 18 '23 14:11 scripter-co

I'm thinking about implementing a scan response timer that will invoke the advertised device callback and then remove it from the vector when max devices is 0 if the scan response doesn't come within 500ms.

h2zero avatar Nov 19 '23 00:11 h2zero

For me, I can disable active scanning for now, see how the devices behave and report back. Obviously this doesn't help @nravanelli.

just to follow up, as expected disabling active scanning does indeed behave normally and there's no memory leak.

scripter-co avatar Nov 23 '23 11:11 scripter-co

@scripter-co @nravanelli I have created a branch that should resolve this here: https://github.com/h2zero/NimBLE-Arduino/tree/sr-timer

Please try it out and let me know, thanks!

h2zero avatar Nov 24 '23 15:11 h2zero

Ill give it a go now, thanks!

nravanelli avatar Nov 24 '23 15:11 nravanelli

@h2zero I have been running an ESP32 with your new fixes to active scan for a little over 40 minutes and HEAP use has stayed constant at 229108 with the same example continuous scan with active scanning.

So I'd say this is resolved!

nravanelli avatar Nov 24 '23 16:11 nravanelli

@scripter-co @nravanelli I have created a branch that should resolve this here: https://github.com/h2zero/NimBLE-Arduino/tree/sr-timer

Please try it out and let me know, thanks!

looks good. question though, if max_results != 0 it seems like this memory leak would still occur?

scripter-co avatar Nov 26 '23 22:11 scripter-co

@scripter-co @nravanelli I have created a branch that should resolve this here: https://github.com/h2zero/NimBLE-Arduino/tree/sr-timer

Please try it out and let me know, thanks!

looks good. question though, if max_results != 0 it seems like this memory leak would still occur?

Indeed, the error persists when ‘max_results !=0’. In my actual code I iterate through the scan results after the scan times out and I can see the heap leak persists. When used within the example “scan continuous” example it is fixed

nravanelli avatar Nov 26 '23 22:11 nravanelli

I was making the assumption that if max_results is not 0 that the scan would stop and be restarted at some point, so not continuous. When restarting the scan the results are typically cleared at that time. If the scan is not restarted then yes the memory would still be occupied, in that case a call to clearResults should be done.

h2zero avatar Nov 26 '23 23:11 h2zero

i'm struggling to find it, could you point out the line where maxResults != 0 would cause the vector to be cleared? or are you saying that the client (consumer of NimBLE) would clear the results themselves?

scripter-co avatar Nov 26 '23 23:11 scripter-co

Not sure I follow? The process is that if scanning continuously with max results = 0 that after the callback is called the device is deleted from the vector. If max results is not 0 devices will be added to the vector until the scan stops and only deleted when the scan is started again, with the isContinue flag set to false.

h2zero avatar Nov 26 '23 23:11 h2zero