tflite-micro icon indicating copy to clipboard operation
tflite-micro copied to clipboard

Custom memory allocation/planning that may not be that custom

Open nicklasb opened this issue 1 year ago • 3 comments

Hi, I am currently trying to make a large model (YOLO) run on the ESP32-S3 (and later the P4), and since the model (> 2MB) doesn't fit in SRAM and TFLite then wants >5 MB for the arena, I am relegated to PSRAM, consequently inference takes 40 seconds to run. I could probably optimize the model quite a bit and so on, however it will never fit in 520 K(or or the 768K) or the ESP32-P4.

However, tflite-micro does not use the SRAM at all because of the arena being allocated as a big blob, and the small built-in caches doesn't do much either. I have tried with providing an MicroAllocator with non-persistent and persistent, but still, the SRAM ends up either having to be > 5MB or unused.

So I am now working on creating a custom MicroAllocator and MicroMemoryPlanner, basically I want to do small allocations in SRAM and large ones in PSRAM, as I am thinking that would minimize the SPI overhead, while fitting in the ~250K SRAM I have available. However, there seem to be a lot going in to this and I have currently rather broken everything, so it wants to get 56 MB from my SRAM. :-)

I'll sort that out eventually, but as it seems likely to me that others probably will end up needing the same functionality when a fairly normal ESP32 suddenly might be able to infer "real" deep learning models a lot of use cases pop up, I am wondering if there is any work going on in this area so that I am not burning my own neurons unnecessarily? :-)

nicklasb avatar Aug 29 '24 10:08 nicklasb

So I have now tried most of the things I can think of:

  1. I have created custom allocators and planners, and I keep running in to the same problem: In the end of the tensor allocation, TF resizes the tensor arena to its maximum size in one big chunk in a way that I haven't been able to do anything smart about.
  2. I have also tried using a custom SingleArenaBufferAllocator, which overrides the allocations and can make smart decisions, however again it is basically not allowed to in the end to to the final resize.

I have made some gains under some circumstanses, but I am encountering strange behaviours that indicate that I don't have the full picture. It seems like some allocations are getting lost in some situations, for example.

For microcontrollers with very little SRAM (as discussed in #2627 ) but actually starts to pack some punch like the ESP32-P4, we are on the edge(!) of doing real deep learning in usable time frames, and these optimizations will basically make or break lots of use cases. This it would be very beneficial if there where possibilities to instruct TFLite micro to be more discerning in its memory allocation.

For example, if it some operations are known to be very memory intensive, these could be moved to a special high speed arena, the allocation plans could for example have a field asking for a specific thing to reside in "fast memory" or whatever. There are even more classes of fast memory available now.

nicklasb avatar Sep 05 '24 23:09 nicklasb

For the issue during tensor allocation where it allocates the entire remaining arena, I wonder if it would be viable for not using the arena for that.

I believe you're referring to this, right? Basically, it allocates the entire remaining portion of the arena to use as a "planner" arena for creating the actual memory plan. The lifetime of this allocation is entirely contained within AllocateTensors. Various bookkeeping data structures are allocated from it to figure out an ideal memory plan for temporary tensors & scratch buffers. The issue that I imagine you're running into is that with reduced arena sizes, you might not have enough memory for planning. Would it be possible for you to provide another, larger arena just for planning? You could potentially just inject an address/size at AllocateTensors for a planner_arena. This wouldn't need to persist beyond initialization.

rascani avatar Sep 23 '24 20:09 rascani

I believe you're referring to this, right? Basically, it allocates the entire remaining portion of the arena to use as a "planner" arena for creating the actual memory plan.

...

Would it be possible for you to provide another, larger arena just for planning? You could potentially just inject an address/size at AllocateTensors for a planner_arena. This wouldn't need to persist beyond initialization.

Possibly. Honestly it would be great if TF Lite "Micro" had some more standard micro controller-oriented settings as memory is always very scarce. I have sort of beaten myself up a bit too much on this angle for now. Instead I am trying to see what caching features on the P4 provied. I will probably revisit this unless I go with a more custom ESP-DL solution.

nicklasb avatar Sep 23 '24 21:09 nicklasb

"This issue is being marked as stale due to inactivity. Remove label or comment to prevent closure in 5 days."

github-actions[bot] avatar Oct 19 '24 10:10 github-actions[bot]

I am trying with ESP-DL instead. Closing.

nicklasb avatar Oct 19 '24 10:10 nicklasb