burn icon indicating copy to clipboard operation
burn copied to clipboard

Memory optimization for loading weights for no_std mode

Open HerrMuellerluedenscheid opened this issue 9 months ago • 9 comments

Hey folks,

Thanks for building a genius framework! From a recent issue, you probably remember that I tried running the SqueezeNet example on an ESP32. I switched to an ESP32-S3 with 8 MB PSRAM. After failing to run it there due to allocation failures, I started a discussion in the #esp-rs:matrix.org chat, which turned out to be super fruitful (BIG shoutout!).

A few key findings and questions that I will try to summarize from the thread:

  1. Burn starts by taking the neural network, pushing it into a Vec, and then cloning it—leading to 5 MB of RAM usage for something that’s already readable in flash. More specifically, the first part is the generated squeezenet1.rs (generated in target/<ARCH>/debug/build/squeezenet-burn-whatever/out/model):
impl<B: Backend> Model<B> {
    pub fn from_embedded(device: &B::Device) -> Self {
        let record = BinBytesRecorder::<HalfPrecisionSettings>::default()
            .load(EMBEDDED_STATES.to_vec(), device) // <-- here is an allocation that shouldn't be needed
            .expect("Should decode state successfully");
        Self::new(device).load_record(record)
    }
}
  1. burn-core, Recorder::load clones the model again. Is that really necessary?
    fn load<R>(&self, args: Self::LoadArgs, device: &B::Device) -> Result<R, RecorderError>
    where
        R: Record<B>,
    {
        let item: BurnRecord<R::Item<Self::Settings>, B> =
            self.load_item(args.clone()).map_err(|err| { // <-- here
  1. can burn operate directly off from the EMBEDDED_STATES, i.e. without copying the model to RAM

  2. Ideally they should split their model into "readonly" stuff and "readwrite" stuff and then the "readonly" stuff is used as-is. I.e. flashed un-decoded.

Apparently there is room for some improvement to run models in no_std which I guess will also be beneficial when running with std. Looking forward hearing your thoughts.

HerrMuellerluedenscheid avatar Mar 05 '25 09:03 HerrMuellerluedenscheid

Yes, I agree there is a lot of room to improve for loading weights. We haven't focused on this initially. But I think this is perfect time. Since you're in the weeds of it, it would be great if you try using more efficient Rust APIs to consume the existing preallocated memory without duplicating (albeit temporarily). I know there are some. We will review your PR.

antimora avatar Mar 05 '25 18:03 antimora

I have the exact same issues, and I noticed the same things with the Raspberry Pi Pico, I would be willing to tackle this issue as well with a team that I'm working with. Is there an active PR for this, or should I create one?

BjornTheProgrammer avatar Mar 06 '25 19:03 BjornTheProgrammer

@BjornTheProgrammer I just pushed some experiments to https://github.com/tracel-ai/burn/pull/2881. On my ESP32s3 it does not panic because of allocation failures. I'm getting this error now instead: /crates/burn-core/src/record/memory.rs:39:85: called Result::unwrap()on anErrvalue: InvalidIntegerType { expected: U32, found: Reserved } But I consider this already some success with regard to the memory. So, feel free to take a look, modify, be inspired :) Would love to have this working on some mcus.

HerrMuellerluedenscheid avatar Mar 09 '25 21:03 HerrMuellerluedenscheid

This is what I'm seeing using BinBytesRecorder to load a ~256kb model. I presume some of it comes from bincode deserializing.

Before loading: Stats {
    allocations: 0,
    deallocations: 0,
    reallocations: 0,
    bytes_allocated: 0,
    bytes_deallocated: 0,
    bytes_reallocated: 0,
}
Stats at 1: Stats {
    allocations: 343,
    deallocations: 246,
    reallocations: 0,
    bytes_allocated: 1072824,
    bytes_deallocated: 802391,
    bytes_reallocated: 0,
}

I've been trying to add a zero-copy serialization like rkyv, but it's not that straightforward as rykv requires Archive annotation, and I can't figure out where in the macros is the Serialze annotation added to the modules or how is it handled. I'd like to have pub trait Recorder... save_item and load_item to become fn save_item<I: Serialize + Archive>(....

ionspin avatar Mar 16 '25 19:03 ionspin

Wow! This is a great find @ionspin! I'll experiment using rykv on my fork as well in #2892. Maybe we could approach it by first creating a new recorder.

BjornTheProgrammer avatar Mar 16 '25 20:03 BjornTheProgrammer

Wow! This is a great find @ionspin! I'll experiment using rykv on my fork as well in #2892. Maybe we could approach it by first creating a new recorder.

I agree, and that's exactly where I got stuck (and please note that I've never used rkyv before, so I might be going the wrong way) because Recorder interface has a Serialize bound in:

fn save_item<I: Serialize>(...

and similar in counterpart load_item and rkyv requires Archive. And I think there are quite a few places that would require adding Archive as annotation, so I'd like to figure those out before continuing.

Is there a way to use rkyv without Archive?

I'm almost tempted to try and grab the weights tensors directly from my model and serialize and deserialize them myself as a quick hack.

ionspin avatar Mar 16 '25 20:03 ionspin

Oh and also note, that I'm not claiming that all of those allocations I posted come from bincode, I haven't looked closely at what happens after the bytes are deserialized by bincode and there probably is more allocations/deallocations there.

ionspin avatar Mar 16 '25 20:03 ionspin

My PR #2892 has been merged! The memory savings weren't quite what I expected. I believe that the most optimization will probably come from the executor backend. I'm going to try to create some tooling or process for inspecting memory usage in depth to really discover where the best savings can come from.

BjornTheProgrammer avatar Apr 08 '25 21:04 BjornTheProgrammer

I've created a new pull request that addresses this issue. From the PR (#3615).

An example repo has been constructed and run on dev, but more cleanups and fixing old tests needs to be done first. Preliminary findings show that it can save about ~40% memory for some models. The more layers, the more savings should be expected.

input: 0 - output: [0.008308247]
max allocated with segmented Model loader: 799780
input: 0 - output: [0.008308247]
max allocated with Model: 1330576

BjornTheProgrammer avatar Aug 26 '25 01:08 BjornTheProgrammer

We are close to having zero copy loading weights. It will be possible to initialize tensor data and tensors from static region.

Here is the plan:

  1. https://github.com/tracel-ai/burn/pull/4100 (burnpack data is aligned) done
  2. Use burnpack store in place of records in burn-import
  3. Implement CubeCL's Bytes Allocation trait to use slices from static region.
  4. Update burn-import embed method. This will embed burnpack file which has aligned offsetted tensor data (along with metadata).
  5. Update the embedded example.

antimora avatar Dec 03 '25 14:12 antimora

Submitted a PR to migrate from record type to burnpack https://github.com/tracel-ai/burn/pull/4122

antimora avatar Dec 04 '25 14:12 antimora

Zero copy in burn-store and CubeCL's Bytes: https://github.com/tracel-ai/burn/issues/4123

antimora avatar Dec 04 '25 16:12 antimora

Submitted a PR for CubeCL's Bytes: https://github.com/tracel-ai/cubecl/pull/1093

antimora avatar Dec 04 '25 20:12 antimora