mitsuba3 icon indicating copy to clipboard operation
mitsuba3 copied to clipboard

How to avoid memory illegal access over time (memory leak)?

Open ShnitzelKiller opened this issue 1 year ago • 9 comments

Summary

Generating and rendering many thousands of scenes in succession eventually leads to a segfault, seemingly due to a memory leak.

System configuration

Rocky Linux release 9.1 (64-bit)

System information:

   CPU: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
  GPU: NVIDIA TITAN Xp
NVIDIA TITAN Xp
NVIDIA TITAN Xp
NVIDIA TITAN Xp
  Python: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC                     11.3.0]
  NVidia driver: 515.65.01
  LLVM: -1.-1.-1

  Dr.Jit: 0.4.1
  Mitsuba: 3.2.1
     Is custom build? False
     Compiled with: GNU 10.2.1
     Variants:
        scalar_rgb
        scalar_spectral
        cuda_ad_rgb
        llvm_ad_rgb

Description

I am generating many scenes and rendering them to images in my python code. The code works fine, but after a while, I inevitably get the error:

Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/ext/drjit-core/src/init.cpp:448.
Aborted (core dumped)

I assume that this is due to a memory leak, since I always get through about the same number of scenes before this error happens again and I have to restart over again.

My question is, is there currently something I can do, such as best practices, to prevent this from happening? Some way of explicitly clearing some kind of cache that fills up over time? Or some other thing that python garbage collection does not handle in the case of Mitsuba?

Steps to reproduce

In a loop, call a subroutine which loads a simple scene and renders that scene thousands of times, and it will eventually hit a segfault.

ShnitzelKiller avatar Apr 28 '23 06:04 ShnitzelKiller

Hi @ShnitzelKiller,

I've encountered the same problem. Have you figured out how to tackle this?

shallowtoil avatar May 23 '23 12:05 shallowtoil

Hi!

We have encountered this ourselves in recent experiments. This most likely is a bug deep within DrJit. We have found that using the CUDA_LAUNCH_BLOCKING flag (set as an environment variable) seems to drastically reduce the likelihood of this segfault happening.

@ShnitzelKiller @shallowtoil could you elaborate on your specifc setups?

Something super basic like:

import mitsuba as mi
mi.set_variant('cuda_ad_rgb')

for i in range(10000):
    mi.render(mi.load_dict(mi.cornell_box()))

Is not enough, we seem to only come across it when optimizing something.

njroussel avatar May 24 '23 11:05 njroussel

@njroussel I have distilled a small runnable example out of my codebase that should hopefully demonstrate the issue: https://github.com/ShnitzelKiller/Mitsuba-test/tree/main

ShnitzelKiller avatar May 24 '23 22:05 ShnitzelKiller

Hi @ShnitzelKiller,

I use Mitsuba for large scale optimizations in Media. I had the same problems with frequent Optix and CUDA crashes when using the CUDA_AD variants but not with LLVM_AD. What seems to help in my case is to compile Mitsuba with CLANG and not e.g., GNU 10.2.1 like in your case:

System information:

  OS: Ubuntu 20.04.6 LTS
  CPU: Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz
  GPU: Quadro RTX 8000
  Python: 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0]
  NVidia driver: 515.105.01
  CUDA: 10.1.243
  LLVM: 12.0.0

  Dr.Jit: 0.4.2
  Mitsuba: 3.3.0
     Is custom build? True
     Compiled with: Clang 10.0.0
     Variants:
        scalar_rgb
        llvm_rgb
        cuda_ad_rgb
        llvm_ad_rgb

I see that you have a non-custom built version not build with CLANG (thats very unfortunate). But you can have a look here how to compile Mitsuba with CLANG, which -- as I said -- helped to get rid of all that errors after many iterations in my case.

dnakath avatar May 31 '23 14:05 dnakath

Hi there,

after one night of optimization, I have to state the using CLANG only seems to mitigate the problem, but unfortunately does not entirely fix it.

dnakath avatar Jun 01 '23 14:06 dnakath

Hello @dnakath and @ShnitzelKiller,

Thank you for reporting this crash and your findings. Could you please try running your test again with the latest Mitsuba master, which includes a fix for a similar-sounding bug? (https://github.com/mitsuba-renderer/drjit-core/pull/78)

merlinND avatar Feb 02 '24 14:02 merlinND

Hi @merlinND,

Great, I think this will ease a lot of pain! I just pulled and tested the new master and it feels like your patch works with

  • cuda_ad_rgb
  • Ubuntu 22.04
  • NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
  • NVIDIA GeForce RTX 4090

I did a run with 5000 iterations lasting 3.5h, where I normally encountered ~~ 125 crashes -- now it ran through. (I implemented a dump and load state option in Adam such that I could resume optimization in case of crashes, otherwise llvm_ad would have been the only remaining option.)

I will do further testing on Monday and report back if something comes up.

Some remaining questions:

  • is it possible to alter the readthedocs.io over this repo, then I could add the 22.04 installation commands
  • would you be interested in the optimizer.dump(), optimizer.load()?

dnakath avatar Feb 03 '24 15:02 dnakath

After some days, the crashes still seem to be gone.

I will try to run another demanding example in the next days/weeks.

dnakath avatar Feb 08 '24 12:02 dnakath

That's great to hear, thank you @dnakath!

I don't know about optimizer.dump() / load(), but about the docs: they are part of this repo, so you can send a pull request with suggested updates.

merlinND avatar Feb 11 '24 21:02 merlinND