candle
candle copied to clipboard
Metal memory leak multiplying matrices
Running this code multiplying a 784x100 matrix times a 100x10 matrix seems to leak memory. The memory usage gradually increases to more than 5 gigabytes when running with the metal feature enabled in release mode on commit 2b10aaa:
use anyhow::Result;
use candle_core::{
backend::BackendDevice,
Device, MetalDevice, Tensor,
};
fn main() -> Result<()> {
let device = Device::Metal(MetalDevice::new(0)?);
let first = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
let second = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
loop {
first.matmul(&second)?;
}
}
On the CPU, memory usage stays steady at ~2 mb. This also seems to effect quantized matrix multiplication
In the full code this is minified from, I see memory usage increase from 15gb to >100gb when feeding in the same sized input to a bert model many times in a row
Thanks for reporting this and providing this short repro - I can reproduce the issue on my m2, I'll have a more in depth look (though it will most likely have to wait for a week or two).
Looks like the issue is related to the autorelease pool not releasing memory fast enough. I'm not very familiar with this but the following code seems to keep the memory usage under control (it might still be drifting but a lot slower than before). Will have to think a bit more about how to handle this in candle.
use anyhow::Result;
use candle_core::{backend::BackendDevice, Device, MetalDevice, Tensor};
fn main() -> Result<()> {
let device = Device::Metal(MetalDevice::new(0)?);
let first = Tensor::randn(0f32, 1.0, (784, 100), &device)?;
let second = Tensor::randn(0f32, 1.0, (100, 10), &device)?;
for i in 0.. {
objc::rc::autoreleasepool(|| {
first.matmul(&second).unwrap();
})
}
Ok(())
}
[edit] slightly better example where we see the memory drifting then getting back to the proper range once the autorelease pool exits.
loop {
println!("here");
objc::rc::autoreleasepool(|| {
for i in 0..1000000 {
first.matmul(&second).unwrap();
}
})
}
Thanks for the workaround. I can confirm adding an autoreleasepool for batches in my bert code does fix the memory leak. Running my workload overnight, I don't seen any meaningful memory usage increase
i am also facing multiple memory leaks on mac only (does not happen on linux and windows or when i switch to cloud model) (in my use case using whisper based on the examples)
code here: https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-audio/src/stt.rs
i'll try the autorelease trick now and share if it makes thing better
update: at a first glance it seems much better with the autoreleasepool (i just spent 2 days optimising the wrong thing 🤦♂️)
update2: i also noticed Xcode Instruments tell me 100k+ leaks when i start my CLI which seems to be related to loading whisper although idk if's bad usage from me (seeing a 1-2 unsafe blocks that seemed necessary from the example) or issue in candle