mitsuba3 [🐛 bug report] mi.load_dict(mi.cornell_box()) will cause segmentation fault

Summary

Running mi.load_dict(mi.cornell_box()) will frequently lead to segmentation fault (sometimes it will not). I tried using a prebuilt wheel as well as compiling from scratch, which will have the same result.

System configuration

OS: Ubuntu 20.04.1 LTS CPU: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz GPU: TITAN RTX Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] NVidia driver: 450.66 LLVM: 12.0.0

Dr.Jit: 0.2.2 Mitsuba: 3.0.2 Is custom build? True Compiled with: Clang 10.0.0 Variants: scalar_rgb llvm_ad_rgb

Description / Steps to reproduce

I tried to run the example script.

# Import the library using the alias "mi"
import mitsuba as mi
# Set the variant of the renderer
mi.set_variant('scalar_rgb')
# Load a scene
scene = mi.load_dict(mi.cornell_box())
# Render the scene
img = mi.render(scene)
# Write the rendered image to an EXR file
mi.Bitmap(img).write('cbox.exr')

This will frequently lead to segmentation fault. There are three outcomes I have encountered when running this script:

It outcomes something like [1] 3181085 segmentation fault python example.py
Another error:

free(): invalid pointer
[1]    3181505 abort      python example.py

The script finishes without error and the image is correctly rendered.

I encounter these three outcomes randomly when running the same script multiple times.

I tried to run pytest mitsuba3/src/core/tests/test_dict.py. It seems that all tests that load a scene dict will have such issue (specifically test07_dict_scene, test09_dict_scene_reference, test10_dict_expand_nested_object). Other tests in test_dict.py can pass.

Sep 21 '22 09:09 Relento

Hi @Relento ,

Could you please run this code through gdb and report the full stack trace? This will be helpful in order to better understand your problem.

Sep 21 '22 09:09 Speierers

Thanks for your prompt reply! This is the logging of the gdb:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff65c6700 (LWP 3363514)]
[New Thread 0x7fffe39f2700 (LWP 3363517)]
[New Thread 0x7fffe31f1700 (LWP 3363518)]
[New Thread 0x7fffe232e700 (LWP 3363519)]
[Thread 0x7ffff65c6700 (LWP 3363514) exited]
[Detaching after fork from child process 3363520]
[New Thread 0x7ffff65c6700 (LWP 3363536)]
[New Thread 0x7fffae5d1700 (LWP 3363537)]
[New Thread 0x7fffaddd0700 (LWP 3363538)]
[New Thread 0x7fffad5cf700 (LWP 3363539)]
[New Thread 0x7fffacdce700 (LWP 3363540)]
[New Thread 0x7fff9ffff700 (LWP 3363541)]
[New Thread 0x7fff9f7fe700 (LWP 3363542)]
[New Thread 0x7fff9effd700 (LWP 3363543)]
[New Thread 0x7fff9e7fc700 (LWP 3363544)]
[New Thread 0x7fff9dffb700 (LWP 3363545)]
[New Thread 0x7fff9d7fa700 (LWP 3363546)]
[New Thread 0x7fff9cff9700 (LWP 3363547)]
[New Thread 0x7fff83fff700 (LWP 3363548)]
[New Thread 0x7fff837fe700 (LWP 3363549)]
[New Thread 0x7fff82ffd700 (LWP 3363550)]
[New Thread 0x7fff827fc700 (LWP 3363551)]
[New Thread 0x7fff81ffb700 (LWP 3363552)]
[New Thread 0x7fff817fa700 (LWP 3363553)]
[New Thread 0x7fff80ff9700 (LWP 3363554)]
[New Thread 0x7fff5ffff700 (LWP 3363555)]
[New Thread 0x7fff5f7fe700 (LWP 3363556)]
[New Thread 0x7fff5effd700 (LWP 3363557)]
[New Thread 0x7fff5e7fc700 (LWP 3363558)]
[New Thread 0x7fff5dffb700 (LWP 3363559)]
[New Thread 0x7fff5d7fa700 (LWP 3363560)]
[New Thread 0x7fff5cff9700 (LWP 3363561)]
[New Thread 0x7fff3ffff700 (LWP 3363562)]
[New Thread 0x7fff3f7fe700 (LWP 3363563)]
[New Thread 0x7fff3effd700 (LWP 3363564)]
[New Thread 0x7fff3e7fc700 (LWP 3363565)]
[New Thread 0x7fff3dffb700 (LWP 3363566)]
[New Thread 0x7fff3d7fa700 (LWP 3363567)]
[New Thread 0x7fff3cff9700 (LWP 3363568)]
[New Thread 0x7fff1ffff700 (LWP 3363569)]
[New Thread 0x7fff1f7fe700 (LWP 3363570)]
[New Thread 0x7fff1effd700 (LWP 3363571)]
[New Thread 0x7fff1e7fc700 (LWP 3363572)]
[New Thread 0x7fff1dffb700 (LWP 3363573)]
[New Thread 0x7fff1d7fa700 (LWP 3363574)]
[New Thread 0x7fff1cff9700 (LWP 3363575)]
[New Thread 0x7fff03fff700 (LWP 3363576)]
[New Thread 0x7fff037fe700 (LWP 3363577)]
[New Thread 0x7fff02ffd700 (LWP 3363578)]
[New Thread 0x7fff027fc700 (LWP 3363579)]
[New Thread 0x7fff01ffb700 (LWP 3363580)]
[New Thread 0x7fff017fa700 (LWP 3363581)]
[New Thread 0x7fff00ff9700 (LWP 3363582)]
[New Thread 0x7ffedffff700 (LWP 3363583)]
[New Thread 0x7ffedf7fe700 (LWP 3363584)]
[New Thread 0x7ffedeffd700 (LWP 3363585)]
[New Thread 0x7ffede7fc700 (LWP 3363586)]
[New Thread 0x7ffeddffb700 (LWP 3363587)]
[New Thread 0x7ffedd7fa700 (LWP 3363588)]
[New Thread 0x7ffedcff9700 (LWP 3363589)]
[New Thread 0x7ffec3fff700 (LWP 3363590)]
[New Thread 0x7ffec37fe700 (LWP 3363591)]
[New Thread 0x7ffec2ffd700 (LWP 3363592)]
[New Thread 0x7ffec27fc700 (LWP 3363593)]
[New Thread 0x7ffec1ffb700 (LWP 3363594)]
[New Thread 0x7ffec17fa700 (LWP 3363595)]
[New Thread 0x7ffec0ff9700 (LWP 3363596)]
[New Thread 0x7ffe9ffff700 (LWP 3363597)]
[New Thread 0x7ffe9f7fe700 (LWP 3363598)]
[New Thread 0x7ffe9effd700 (LWP 3363599)]
[New Thread 0x7ffe9e7fc700 (LWP 3363600)]
[New Thread 0x7ffe9dffb700 (LWP 3363601)]
[New Thread 0x7ffe9d7fa700 (LWP 3363602)]
[New Thread 0x7ffe9cff9700 (LWP 3363603)]
[New Thread 0x7ffe7ffff700 (LWP 3363604)]
[New Thread 0x7ffe7f7fe700 (LWP 3363605)]
[New Thread 0x7ffe7effd700 (LWP 3363606)]
[New Thread 0x7ffe7e7fc700 (LWP 3363607)]

Thread 7 "DrJit worker 2" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffae5d1700 (LWP 3363537)]
0x00007fffe42a91ec in embree::TaskScheduler::steal_from_other_threads(embree::TaskScheduler::Thread&) () from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so

Sep 21 '22 18:09 Relento

I tried on another environment with RTX 3090. For scalar_rgb variant it will have the same segmentation fault and the gdb trace seems to be the same. However, when I switched to cuda_ad_rgb it worked well without any error.

This is the info of the environment:

System information:

  OS: Ubuntu 20.04.2 LTS
  CPU: AMD EPYC 7402 24-Core Processor
  GPU: NVIDIA GeForce RTX 3090
  Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
  NVidia driver: 515.43.04
  LLVM: 12.0.0

  Dr.Jit: 0.2.2
  Mitsuba: 3.0.2
     Is custom build? True
     Compiled with: Clang 10.0.0
     Variants:
        scalar_rgb
        scalar_spectral
        cuda_ad_rgb
        llvm_ad_rgb

Sep 21 '22 18:09 Relento

I was expecting the full backtrace from GDB. For this could you please compile in DEBUG mode, run your script using gdb and then enter the backtrace command in gdb?

Also I would be curious to know whether this issue occurs with the llvm_ad_rgb variant as well.

Finally, you could play with the parallel argument of mi.load_dict() to see whether that makes a difference.

Let me know about your findings, it would be great to get this bug fixed! ⚔️

Sep 22 '22 07:09 Speierers

This is the backtrace for scalar_rgb:

#0  0x00007fffea2a91ec in embree::TaskScheduler::steal_from_other_threads(embree::TaskScheduler::Thread&) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#1  0x00007fffea2a89af in embree::TaskScheduler::thread_loop(unsigned long) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#2  0x00007fffea2a900a in embree::TaskScheduler::join() ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#3  0x00007fffe9a76d19 in embree::Scene::commit(bool) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#4  0x00007fffe9a421c1 in rtcJoinCommitScene ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#5  0x00007ffff7fb01f0 in pool_execute_task(Pool*, bool (*)(void*), void*) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#6  0x00007ffff7fb03d3 in Worker::run() ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#7  0x00007ffff7fb0478 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (Worker::*)(), Worker*> >(void*) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#8  0x00007ffff7f90609 in start_thread ()

And this is the backtrace for llvm_ad_rgb:

#0  0x00007fffea2a91ec in embree::TaskScheduler::steal_from_other_threads(embree::TaskScheduler::Thread&) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#1  0x00007fffea2a89af in embree::TaskScheduler::thread_loop(unsigned long) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#2  0x00007fffea2a900a in embree::TaskScheduler::join() ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#3  0x00007fffe9a76d19 in embree::Scene::commit(bool) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#4  0x00007fffe9a421c1 in rtcJoinCommitScene ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#5  0x00007ffff7fb01f0 in pool_execute_task(Pool*, bool (*)(void*), void*) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#6  0x00007ffff7fb03d3 in Worker::run() ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#7  0x00007ffff7fb0478 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (Worker::*)(), Worker*> >(void*) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#8  0x00007ffff7f90609 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007ffff7eb7103 in clone () from /lib/x86_64-linux-gnu/libc.so.6

There is no parallel argument for mi.load_dict(). I try to play around the parallel argument in the mi.load_file function but the issue still persists regardless of the argument.

Sep 23 '22 02:09 Relento

I encountered issue by calling render twice (on the same scene with different parameters), and then writing both images out to file via util.convert_to_bitmap. If restricted to one core via taskset 1, this crashes ~10% of the time. The issue seems to go away if you intersperse the calls to render with the file saving.

Oct 05 '23 16:10 kach

Hi @kach

Could you share some pseudocode just to make sure I understand correctly? I'll try to see if I can replicate this bug.

Oct 06 '23 07:10 njroussel

Something like this:

scene = make_scene()
params = mi.traverse(scene)

install_param_settings_A(params); params.update()
img_A = mi.render(scene, params)

install_param_settings_B(params); params.update()
img_B = mi.render(scene, params)

plg.imsave('A.png', mi.util.convert_to_bitmap(img_A))
plg.imsave('B.png', mi.util.convert_to_bitmap(img_B))

Oct 10 '23 00:10 kach

I'm unable to reproduce this.

Please open a new issue with all the required information. I'm not sure what plg is or does, but we include utilities to write images to a file:

Either directly from the Bitmap object: Bitmap.write
Using the Python utility mi.util.write_to_bitmap

Oct 10 '23 06:10 njroussel

Oh, I'm sorry, I meant to type plt as in Matplotlib. But that's good to know about the built-in utilities to write images to files — thanks! :)

I spent some time trying to get a minimal crashing program, but no luck… I will open a new issue if I come up with something useful for you.

Thanks again!

Oct 11 '23 03:10 kach