mitsuba3
mitsuba3 copied to clipboard
How to avoid memory illegal access over time (memory leak)?
Summary
Generating and rendering many thousands of scenes in succession eventually leads to a segfault, seemingly due to a memory leak.
System configuration
Rocky Linux release 9.1 (64-bit)
System information:
CPU: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
GPU: NVIDIA TITAN Xp
NVIDIA TITAN Xp
NVIDIA TITAN Xp
NVIDIA TITAN Xp
Python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
NVidia driver: 515.65.01
LLVM: -1.-1.-1
Dr.Jit: 0.4.1
Mitsuba: 3.2.1
Is custom build? False
Compiled with: GNU 10.2.1
Variants:
scalar_rgb
scalar_spectral
cuda_ad_rgb
llvm_ad_rgb
Description
I am generating many scenes and rendering them to images in my python code. The code works fine, but after a while, I inevitably get the error:
Critical Dr.Jit compiler failure: cuda_check(): API error 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/ext/drjit-core/src/init.cpp:448.
Aborted (core dumped)
I assume that this is due to a memory leak, since I always get through about the same number of scenes before this error happens again and I have to restart over again.
My question is, is there currently something I can do, such as best practices, to prevent this from happening? Some way of explicitly clearing some kind of cache that fills up over time? Or some other thing that python garbage collection does not handle in the case of Mitsuba?
Steps to reproduce
In a loop, call a subroutine which loads a simple scene and renders that scene thousands of times, and it will eventually hit a segfault.
Hi @ShnitzelKiller,
I've encountered the same problem. Have you figured out how to tackle this?
Hi!
We have encountered this ourselves in recent experiments. This most likely is a bug deep within DrJit. We have found that using the CUDA_LAUNCH_BLOCKING
flag (set as an environment variable) seems to drastically reduce the likelihood of this segfault happening.
@ShnitzelKiller @shallowtoil could you elaborate on your specifc setups?
Something super basic like:
import mitsuba as mi
mi.set_variant('cuda_ad_rgb')
for i in range(10000):
mi.render(mi.load_dict(mi.cornell_box()))
Is not enough, we seem to only come across it when optimizing something.
@njroussel I have distilled a small runnable example out of my codebase that should hopefully demonstrate the issue: https://github.com/ShnitzelKiller/Mitsuba-test/tree/main
Hi @ShnitzelKiller,
I use Mitsuba for large scale optimizations in Media. I had the same problems with frequent Optix and CUDA crashes when using the CUDA_AD variants but not with LLVM_AD. What seems to help in my case is to compile Mitsuba with CLANG and not e.g., GNU 10.2.1
like in your case:
System information:
OS: Ubuntu 20.04.6 LTS
CPU: Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz
GPU: Quadro RTX 8000
Python: 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0]
NVidia driver: 515.105.01
CUDA: 10.1.243
LLVM: 12.0.0
Dr.Jit: 0.4.2
Mitsuba: 3.3.0
Is custom build? True
Compiled with: Clang 10.0.0
Variants:
scalar_rgb
llvm_rgb
cuda_ad_rgb
llvm_ad_rgb
I see that you have a non-custom built version not build with CLANG (thats very unfortunate). But you can have a look here how to compile Mitsuba with CLANG, which -- as I said -- helped to get rid of all that errors after many iterations in my case.
Hi there,
after one night of optimization, I have to state the using CLANG only seems to mitigate the problem, but unfortunately does not entirely fix it.
Hello @dnakath and @ShnitzelKiller,
Thank you for reporting this crash and your findings.
Could you please try running your test again with the latest Mitsuba master
, which includes a fix for a similar-sounding bug? (https://github.com/mitsuba-renderer/drjit-core/pull/78)
Hi @merlinND,
Great, I think this will ease a lot of pain! I just pulled and tested the new master and it feels like your patch works with
- cuda_ad_rgb
- Ubuntu 22.04
- NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
- NVIDIA GeForce RTX 4090
I did a run with 5000 iterations lasting 3.5h, where I normally encountered ~~ 125 crashes -- now it ran through. (I implemented a dump and load state option in Adam such that I could resume optimization in case of crashes, otherwise llvm_ad would have been the only remaining option.)
I will do further testing on Monday and report back if something comes up.
Some remaining questions:
- is it possible to alter the readthedocs.io over this repo, then I could add the 22.04 installation commands
- would you be interested in the optimizer.dump(), optimizer.load()?
After some days, the crashes still seem to be gone.
I will try to run another demanding example in the next days/weeks.
That's great to hear, thank you @dnakath!
I don't know about optimizer.dump() / load()
, but about the docs: they are part of this repo, so you can send a pull request with suggested updates.