Getting strange assertion error
Hi, I'm reopening this issue from https://github.com/mitsuba-renderer/mitsuba3/issues/190 and https://github.com/mitsuba-renderer/drjit-core/issues/63.
Summary
Hello, @merlinND @wjakob following up from: https://github.com/mitsuba-renderer/drjit-core/issues/63 I am now consistently running into this issue, but at random times during execution.
The error is as follows: Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
More detail below.
System configuration
System information:
OS: Ubuntu 22.04.3 LTS CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz GPU: NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX Python: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0] NVidia driver: 530.30.02 LLVM: 10.0.1
Dr.Jit: 0.4.1 Mitsuba: 3.2.1 Is custom build? False Compiled with: GNU 10.2.1 Variants: scalar_rgb scalar_spectral cuda_ad_rgb llvm_ad_rgb
Description
I have a particular job that successfully completes when using my Macbook pro. However, that environment is not computationally scalable enough for my tasks, and I am only able to achieve low-resolution results. So I am trying to run on ec2 instances and other linux machines with 50+ CPUs. This is where I see the issue.
I have a conda environment where I installed mistuba3 using Pip. I tried these versions:
mitsuba: 3.30 drjit: 0.4.2
and
mitsuba-3.2.1 drjit-0.4.1
I have LLVM v15 and v10 .so installed (via conda install), and set the path accordingly:
export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-15.so export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-10.so
(I also installed LLVM using apt get, no diff, also tried to build LLMV from source, no diff)
All of the above configurations result in this error: Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
I occurs at random times during execution -- sometimes after a few minutes, sometimes after a hour. No stack trace or any debugging info.
Any ideas where to start?
Steps to reproduce
I can share my repo (including clear instructions to reproduce in the readme) with you - let me know if you would like
Hi @sagesimhon
Yes, we'd appreciate the reproducer. Tracking such an issue down is painful. As you've seen, this bug is a bit elusive. There most likely is some underlying race condition at play here.
If you want to give it an attempt yourself, I'd recommend compiling the project yourself and in DEBUG mode. This assertion is in our thread pool/worker management system. One of the threads should have launched some asynchronous job from drjit.
Hi @njroussel,
Just added you to my git repo https://github.com/sagesimhon/totem_plus To reproduce with minimal settings,
Follow instructions in "Dependencies" section of the README, then run
python run_generation.py --res 256 --exp_folder 'test_minimal_run_reproducer' --n 999.
The code is one large for loop. The assertion error comes up at random times, for me in the last run it came at iteration 947, after three minutes. Here is the last thing printed before getting the error:
Trying iter 947, file 0000947
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
Aborted (core dumped)
This error is seen on a linux machine with many CPUs (>48)..but it maybe be the case with lower cpu counts. I do not run into it using a mac.
UPDATE: I believe i have isolated the issue, it may be coming from mi.load_file() method --- the xml loading method. My process loads many xml files, and this function may be problematic by potentially running our of threads or file pointers or something else if they are not closed properly -- just guessing the root cause.
I have ported all my code to use the 'dict' data instead of xml -- and am still getting the error : ( @njroussel Is there anything else you need for the reproducer?
I haven't been able to reproduce it yet... You might want to turn off parallel scene loading if you thing that is the issue here. (documentation)
Thanks, I will try it and hopefully it will not degrade performance too much -- the problem definitely seems to come from the load_dict() method.
Hello, in my situation, it seems that mi.util.write_bitmap() also causes this error. In my code, this function will be called many times. Is this error also due to lack of file pointers?
Something with the asynchronous job manager in drjit is going awry. The write_bitmap() function writes the files asynchronously by default, you can change it. It might therefore be related.
Hi all - writing to say that I have also been experiencing this issue. I can't share my reproducer (~200 lines) publicly but am happy to email with someone on the team if that would be helpful.
I have a similar issue. I create a room with some furniture in Blender, export this room with Mitsuba to a .xml file and load this file in Sionna with load_scene(...). Since I want to get the channel impulse response at many different locations in the room, I run a for loop over an array of receiver positions (rx_pos), and re-set the position of the reciever (rx) in the scene in each iteration as follows:
for i_c, i_rx_pos in enumerate(rx_pos):
rx.position = i_rx_pos # set position of the reciever
paths = scene.compute_paths(max_depth=3)
After a random number (typically somewhere between 1000 and 4000) of iterations, the process is finished with the following output:
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1 Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
I use only CPUs (no GPU) on Ubuntu, "lsb_release -a" yields: No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.6 LTS Release: 20.04 Codename: focal
Python version: 3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0] Mitsuba version: '3.4.0' Dr. Jit version: '0.4.3'
Hi @sagesimhon -- I just wanted to access https://github.com/sagesimhon/totem_plus but cannot. Is it a private repository?