candle Metal memory leak multiplying matrices

Running this code multiplying a 784x100 matrix times a 100x10 matrix seems to leak memory. The memory usage gradually increases to more than 5 gigabytes when running with the metal feature enabled in release mode on commit 2b10aaa:

use anyhow::Result;
use candle_core::{
    backend::BackendDevice,
    Device, MetalDevice, Tensor,
};

fn main() -> Result<()> {
    let device = Device::Metal(MetalDevice::new(0)?);
    let first = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
    let second = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
    loop {
        first.matmul(&second)?;
    }
}

On the CPU, memory usage stays steady at ~2 mb. This also seems to effect quantized matrix multiplication

In the full code this is minified from, I see memory usage increase from 15gb to >100gb when feeding in the same sized input to a bert model many times in a row

Jun 17 '24 18:06 ealmloff

Thanks for reporting this and providing this short repro - I can reproduce the issue on my m2, I'll have a more in depth look (though it will most likely have to wait for a week or two).

Jun 22 '24 21:06 LaurentMazare

Looks like the issue is related to the autorelease pool not releasing memory fast enough. I'm not very familiar with this but the following code seems to keep the memory usage under control (it might still be drifting but a lot slower than before). Will have to think a bit more about how to handle this in candle.

use anyhow::Result;
use candle_core::{backend::BackendDevice, Device, MetalDevice, Tensor};

fn main() -> Result<()> {
    let device = Device::Metal(MetalDevice::new(0)?);
    let first = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
    let second = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
    for i in 0.. {
        objc::rc::autoreleasepool(|| {
            first.matmul(&second).unwrap();
        })
    }
    Ok(())
}

[edit] slightly better example where we see the memory drifting then getting back to the proper range once the autorelease pool exits.

    loop {
        println!("here");
        objc::rc::autoreleasepool(|| {
            for i in 0..1000000 {
                first.matmul(&second).unwrap();
            }
        })
    }

Jun 22 '24 22:06 LaurentMazare

Thanks for the workaround. I can confirm adding an autoreleasepool for batches in my bert code does fix the memory leak. Running my workload overnight, I don't seen any meaningful memory usage increase

Jun 23 '24 17:06 ealmloff

i am also facing multiple memory leaks on mac only (does not happen on linux and windows or when i switch to cloud model) (in my use case using whisper based on the examples)

code here: https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-audio/src/stt.rs

i'll try the autorelease trick now and share if it makes thing better

update: at a first glance it seems much better with the autoreleasepool (i just spent 2 days optimising the wrong thing 🤦‍♂️)

update2: i also noticed Xcode Instruments tell me 100k+ leaks when i start my CLI which seems to be related to loading whisper although idk if's bad usage from me (seeing a 1-2 unsafe blocks that seemed necessary from the example) or issue in candle

(but can't check in details due to "call stack limited") (ignore the coreaudio leak, that's a tiny memory leak in `cpal` lib i noticed)

Sep 01 '24 22:09 louis030195

candle candle copied to clipboard

Metal memory leak multiplying matrices

candle
candle copied to clipboard