burn icon indicating copy to clipboard operation
burn copied to clipboard

Memory leak in Wgpu backend

Open joshhansen opened this issue 1 year ago • 8 comments

Describe the bug Memory leaks when using the Wgpu backend, but not when using the NdArray backend.

To Reproduce A minimal reproduction is available at https://github.com/joshhansen/burnleak.

Check out the repository and cargo run --release on the master branch. Watch the memory usage. In my case, it climbs steadily upward at a rapid pace.

Then check out the master_ndarray branch and run it again. In my case, the memory usage does not climb.

Running on Wgpu but with the WgpuDevice::Cpu device slows but does not eliminate the memory leakage.

Expected behavior The program should run without a memory leak on the Wgpu backend like it does on the NdArray backend.

Screenshots n/a

Desktop:

  • OS: Linux Mint 21.3, kernel 6.5.0 x86_64
  • Burn: 0.13.2
  • Rust: 1.76.0
  • Nvidia driver 545.29.06-0ubuntu0.22.04.2

Additional context Heaptrack memory leak profile: heaptrack1

heaptrack2

joshhansen avatar Jul 20 '24 04:07 joshhansen

I'm experiencing the same leak when I upgraded from burn 0.12 to burn 0.13.

I tried your repro repo using burn 0.12 as a dep and there is not an aggressive leak, so it seems like a regression.


This is what I saw with my personal hobby project monitoring

image

kurtlawrence avatar Jul 20 '24 05:07 kurtlawrence

Thanks for the quick repro @kurtlawrence . I can confirm that downgrading to 0.12.1 results in much better memory usage - there are still leaks coming from wgpu but less by an order of magnitude. (82.6 MB leaks on 1.4 G of heaptrack samples vs. 2.1 GB of leaks on 626 MB of heaptrack samples)

joshhansen avatar Jul 20 '24 20:07 joshhansen

I've been working on the memory management strategy currently implemented in the wgpu runtime on the master branch. The current approach results in higher average memory usage, which is intentional. The new strategy is designed to be lazier in freeing unused memory and more aggressive in reusing it. For dynamic memory workloads, this can lead to performance improvements of up to 60%.

Our assumption is that for most deep learning use cases, the GPU device will be primarily dedicated to the training or inference tasks being performed. Therefore, we've prioritized better performance at the cost of higher average memory usage. While we don't believe this strategy leads to significantly higher peak memory usage or more frequent out-of-memory situations, we recognize that this could be a potential issue.

If average memory usage is a concern, we could consider adding an option for users to configure the memory management behavior in the wgpu runtime.

mepatrick73 avatar Jul 23 '24 15:07 mepatrick73

@kurtlawrence Is it CPU RAM leakage or GPU memory?

nathanielsimard avatar Jul 24 '24 13:07 nathanielsimard

@kurtlawrence Is it CPU RAM leakage or GPU memory?

CPU RAM leaking

kurtlawrence avatar Jul 25 '24 06:07 kurtlawrence

@mepatrick73 The issue is not the level of memory usage, it's that the memory usage grows without bound. Eventually it would consume all system memory if I didn't terminate the program. I have 128GB system RAM and 4 GPUs with 48 GB VRAM. The task in my case is inference on a small model - each instance consumes 28 MiB of VRAM as per nvidia-smi.

joshhansen avatar Jul 26 '24 07:07 joshhansen

I am experiencing similar issue.

The way to reproduce it as simple as:

type MyBackend = burn::backend::wgpu::JitBackend<cubecl::wgpu::WgpuRuntime, f32, i32, u32>;

let tensor = Tensor::<MyBackend, 3>::random([3000, 2], burn::tensor::Distribution::Default, &device);

loop {
    let tensor = energies.clone().sum_dim(0);
    let tensor = energies.into_data();

    println!("dbg {}", tensor.as_slice::<f32>().unwrap().len());
}

Reproduces on 0.16.1

0x7CFE avatar Apr 04 '25 15:04 0x7CFE

I also am facing this in my repo https://github.com/utahrobotics/gym-burn. Just run

cargo run -p proximo --release -- train

after moving all the files in model-configs/handwritten into the root directory. You will also need a database, which i can send if needed. However, I added dhat but I can't tell what fixes I can do on my end. Both invocations are very similar, but dhat-heap1.json was ran for a shorter amount of time.

dhat-heap.json dhat-heap1.json

manglemix avatar Oct 10 '25 22:10 manglemix