mitsuba3 icon indicating copy to clipboard operation
mitsuba3 copied to clipboard

Getting strange assertion error

Open sagesimhon opened this issue 2 years ago • 11 comments

Hi, I'm reopening this issue from https://github.com/mitsuba-renderer/mitsuba3/issues/190 and https://github.com/mitsuba-renderer/drjit-core/issues/63.

Summary

Hello, @merlinND @wjakob following up from: https://github.com/mitsuba-renderer/drjit-core/issues/63 I am now consistently running into this issue, but at random times during execution.

The error is as follows: Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1

More detail below.

System configuration

System information:

OS: Ubuntu 22.04.3 LTS CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz GPU: NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX NVIDIA TITAN RTX Python: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0] NVidia driver: 530.30.02 LLVM: 10.0.1

Dr.Jit: 0.4.1 Mitsuba: 3.2.1 Is custom build? False Compiled with: GNU 10.2.1 Variants: scalar_rgb scalar_spectral cuda_ad_rgb llvm_ad_rgb

Description

I have a particular job that successfully completes when using my Macbook pro. However, that environment is not computationally scalable enough for my tasks, and I am only able to achieve low-resolution results. So I am trying to run on ec2 instances and other linux machines with 50+ CPUs. This is where I see the issue.

I have a conda environment where I installed mistuba3 using Pip. I tried these versions:

mitsuba: 3.30 drjit: 0.4.2

and

mitsuba-3.2.1 drjit-0.4.1

I have LLVM v15 and v10 .so installed (via conda install), and set the path accordingly:

export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-15.so export DRJIT_LIBLLVM_PATH= .... /miniconda3/envs/mi/lib/libLLVM-10.so

(I also installed LLVM using apt get, no diff, also tried to build LLMV from source, no diff)

All of the above configurations result in this error: Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1

I occurs at random times during execution -- sometimes after a few minutes, sometimes after a hour. No stack trace or any debugging info.

Any ideas where to start?

Steps to reproduce

I can share my repo (including clear instructions to reproduce in the readme) with you - let me know if you would like

sagesimhon avatar Aug 11 '23 05:08 sagesimhon

Hi @sagesimhon

Yes, we'd appreciate the reproducer. Tracking such an issue down is painful. As you've seen, this bug is a bit elusive. There most likely is some underlying race condition at play here.

If you want to give it an attempt yourself, I'd recommend compiling the project yourself and in DEBUG mode. This assertion is in our thread pool/worker management system. One of the threads should have launched some asynchronous job from drjit.

njroussel avatar Aug 13 '23 00:08 njroussel

Hi @njroussel,

Just added you to my git repo https://github.com/sagesimhon/totem_plus To reproduce with minimal settings,

Follow instructions in "Dependencies" section of the README, then run python run_generation.py --res 256 --exp_folder 'test_minimal_run_reproducer' --n 999. The code is one large for loop. The assertion error comes up at random times, for me in the last run it came at iteration 947, after three minutes. Here is the last thing printed before getting the error:

Trying iter 947, file 0000947
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1
Aborted (core dumped)

This error is seen on a linux machine with many CPUs (>48)..but it maybe be the case with lower cpu counts. I do not run into it using a mac.

sagesimhon avatar Aug 14 '23 22:08 sagesimhon

UPDATE: I believe i have isolated the issue, it may be coming from mi.load_file() method --- the xml loading method. My process loads many xml files, and this function may be problematic by potentially running our of threads or file pointers or something else if they are not closed properly -- just guessing the root cause.

sagesimhon avatar Aug 15 '23 04:08 sagesimhon

I have ported all my code to use the 'dict' data instead of xml -- and am still getting the error : ( @njroussel Is there anything else you need for the reproducer?

sagesimhon avatar Aug 17 '23 06:08 sagesimhon

I haven't been able to reproduce it yet... You might want to turn off parallel scene loading if you thing that is the issue here. (documentation)

njroussel avatar Aug 31 '23 13:08 njroussel

Thanks, I will try it and hopefully it will not degrade performance too much -- the problem definitely seems to come from the load_dict() method.

sagesimhon avatar Sep 04 '23 20:09 sagesimhon

Hello, in my situation, it seems that mi.util.write_bitmap() also causes this error. In my code, this function will be called many times. Is this error also due to lack of file pointers?

FYRichie avatar Sep 05 '23 17:09 FYRichie

Something with the asynchronous job manager in drjit is going awry. The write_bitmap() function writes the files asynchronously by default, you can change it. It might therefore be related.

njroussel avatar Sep 06 '23 06:09 njroussel

Hi all - writing to say that I have also been experiencing this issue. I can't share my reproducer (~200 lines) publicly but am happy to email with someone on the team if that would be helpful.

kach avatar Oct 05 '23 16:10 kach

I have a similar issue. I create a room with some furniture in Blender, export this room with Mitsuba to a .xml file and load this file in Sionna with load_scene(...). Since I want to get the channel impulse response at many different locations in the room, I run a for loop over an array of receiver positions (rx_pos), and re-set the position of the reciever (rx) in the scene in each iteration as follows:

for i_c, i_rx_pos in enumerate(rx_pos):
	rx.position = i_rx_pos  # set position of the reciever
	paths = scene.compute_paths(max_depth=3)

After a random number (typically somewhere between 1000 and 4000) of iterations, the process is finished with the following output:

Assertion failed in /project/ext/drjit-core/ext/nanothread/src/queue.cpp:354: remain == 1 Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

I use only CPUs (no GPU) on Ubuntu, "lsb_release -a" yields: No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.6 LTS Release: 20.04 Codename: focal

Python version: 3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0] Mitsuba version: '3.4.0' Dr. Jit version: '0.4.3'

member67 avatar Nov 24 '23 09:11 member67

Hi @sagesimhon -- I just wanted to access https://github.com/sagesimhon/totem_plus but cannot. Is it a private repository?

wjakob avatar Dec 02 '23 22:12 wjakob