mitsuba3 icon indicating copy to clipboard operation
mitsuba3 copied to clipboard

[🐛 bug report] mi.load_dict(mi.cornell_box()) will cause segmentation fault

Open Relento opened this issue 3 years ago • 10 comments

Summary

Running mi.load_dict(mi.cornell_box()) will frequently lead to segmentation fault (sometimes it will not). I tried using a prebuilt wheel as well as compiling from scratch, which will have the same result.

System configuration

OS: Ubuntu 20.04.1 LTS CPU: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz GPU: TITAN RTX Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] NVidia driver: 450.66 LLVM: 12.0.0

Dr.Jit: 0.2.2 Mitsuba: 3.0.2 Is custom build? True Compiled with: Clang 10.0.0 Variants: scalar_rgb llvm_ad_rgb

Description / Steps to reproduce

I tried to run the example script.

# Import the library using the alias "mi"
import mitsuba as mi
# Set the variant of the renderer
mi.set_variant('scalar_rgb')
# Load a scene
scene = mi.load_dict(mi.cornell_box())
# Render the scene
img = mi.render(scene)
# Write the rendered image to an EXR file
mi.Bitmap(img).write('cbox.exr')

This will frequently lead to segmentation fault. There are three outcomes I have encountered when running this script:

  1. It outcomes something like [1] 3181085 segmentation fault python example.py
  2. Another error:
free(): invalid pointer
[1]    3181505 abort      python example.py
  1. The script finishes without error and the image is correctly rendered.

I encounter these three outcomes randomly when running the same script multiple times.

I tried to run pytest mitsuba3/src/core/tests/test_dict.py. It seems that all tests that load a scene dict will have such issue (specifically test07_dict_scene, test09_dict_scene_reference, test10_dict_expand_nested_object). Other tests in test_dict.py can pass.

Relento avatar Sep 21 '22 09:09 Relento

Hi @Relento ,

Could you please run this code through gdb and report the full stack trace? This will be helpful in order to better understand your problem.

Speierers avatar Sep 21 '22 09:09 Speierers

Thanks for your prompt reply! This is the logging of the gdb:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff65c6700 (LWP 3363514)]
[New Thread 0x7fffe39f2700 (LWP 3363517)]
[New Thread 0x7fffe31f1700 (LWP 3363518)]
[New Thread 0x7fffe232e700 (LWP 3363519)]
[Thread 0x7ffff65c6700 (LWP 3363514) exited]
[Detaching after fork from child process 3363520]
[New Thread 0x7ffff65c6700 (LWP 3363536)]
[New Thread 0x7fffae5d1700 (LWP 3363537)]
[New Thread 0x7fffaddd0700 (LWP 3363538)]
[New Thread 0x7fffad5cf700 (LWP 3363539)]
[New Thread 0x7fffacdce700 (LWP 3363540)]
[New Thread 0x7fff9ffff700 (LWP 3363541)]
[New Thread 0x7fff9f7fe700 (LWP 3363542)]
[New Thread 0x7fff9effd700 (LWP 3363543)]
[New Thread 0x7fff9e7fc700 (LWP 3363544)]
[New Thread 0x7fff9dffb700 (LWP 3363545)]
[New Thread 0x7fff9d7fa700 (LWP 3363546)]
[New Thread 0x7fff9cff9700 (LWP 3363547)]
[New Thread 0x7fff83fff700 (LWP 3363548)]
[New Thread 0x7fff837fe700 (LWP 3363549)]
[New Thread 0x7fff82ffd700 (LWP 3363550)]
[New Thread 0x7fff827fc700 (LWP 3363551)]
[New Thread 0x7fff81ffb700 (LWP 3363552)]
[New Thread 0x7fff817fa700 (LWP 3363553)]
[New Thread 0x7fff80ff9700 (LWP 3363554)]
[New Thread 0x7fff5ffff700 (LWP 3363555)]
[New Thread 0x7fff5f7fe700 (LWP 3363556)]
[New Thread 0x7fff5effd700 (LWP 3363557)]
[New Thread 0x7fff5e7fc700 (LWP 3363558)]
[New Thread 0x7fff5dffb700 (LWP 3363559)]
[New Thread 0x7fff5d7fa700 (LWP 3363560)]
[New Thread 0x7fff5cff9700 (LWP 3363561)]
[New Thread 0x7fff3ffff700 (LWP 3363562)]
[New Thread 0x7fff3f7fe700 (LWP 3363563)]
[New Thread 0x7fff3effd700 (LWP 3363564)]
[New Thread 0x7fff3e7fc700 (LWP 3363565)]
[New Thread 0x7fff3dffb700 (LWP 3363566)]
[New Thread 0x7fff3d7fa700 (LWP 3363567)]
[New Thread 0x7fff3cff9700 (LWP 3363568)]
[New Thread 0x7fff1ffff700 (LWP 3363569)]
[New Thread 0x7fff1f7fe700 (LWP 3363570)]
[New Thread 0x7fff1effd700 (LWP 3363571)]
[New Thread 0x7fff1e7fc700 (LWP 3363572)]
[New Thread 0x7fff1dffb700 (LWP 3363573)]
[New Thread 0x7fff1d7fa700 (LWP 3363574)]
[New Thread 0x7fff1cff9700 (LWP 3363575)]
[New Thread 0x7fff03fff700 (LWP 3363576)]
[New Thread 0x7fff037fe700 (LWP 3363577)]
[New Thread 0x7fff02ffd700 (LWP 3363578)]
[New Thread 0x7fff027fc700 (LWP 3363579)]
[New Thread 0x7fff01ffb700 (LWP 3363580)]
[New Thread 0x7fff017fa700 (LWP 3363581)]
[New Thread 0x7fff00ff9700 (LWP 3363582)]
[New Thread 0x7ffedffff700 (LWP 3363583)]
[New Thread 0x7ffedf7fe700 (LWP 3363584)]
[New Thread 0x7ffedeffd700 (LWP 3363585)]
[New Thread 0x7ffede7fc700 (LWP 3363586)]
[New Thread 0x7ffeddffb700 (LWP 3363587)]
[New Thread 0x7ffedd7fa700 (LWP 3363588)]
[New Thread 0x7ffedcff9700 (LWP 3363589)]
[New Thread 0x7ffec3fff700 (LWP 3363590)]
[New Thread 0x7ffec37fe700 (LWP 3363591)]
[New Thread 0x7ffec2ffd700 (LWP 3363592)]
[New Thread 0x7ffec27fc700 (LWP 3363593)]
[New Thread 0x7ffec1ffb700 (LWP 3363594)]
[New Thread 0x7ffec17fa700 (LWP 3363595)]
[New Thread 0x7ffec0ff9700 (LWP 3363596)]
[New Thread 0x7ffe9ffff700 (LWP 3363597)]
[New Thread 0x7ffe9f7fe700 (LWP 3363598)]
[New Thread 0x7ffe9effd700 (LWP 3363599)]
[New Thread 0x7ffe9e7fc700 (LWP 3363600)]
[New Thread 0x7ffe9dffb700 (LWP 3363601)]
[New Thread 0x7ffe9d7fa700 (LWP 3363602)]
[New Thread 0x7ffe9cff9700 (LWP 3363603)]
[New Thread 0x7ffe7ffff700 (LWP 3363604)]
[New Thread 0x7ffe7f7fe700 (LWP 3363605)]
[New Thread 0x7ffe7effd700 (LWP 3363606)]
[New Thread 0x7ffe7e7fc700 (LWP 3363607)]

Thread 7 "DrJit worker 2" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffae5d1700 (LWP 3363537)]
0x00007fffe42a91ec in embree::TaskScheduler::steal_from_other_threads(embree::TaskScheduler::Thread&) () from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so

Relento avatar Sep 21 '22 18:09 Relento

I tried on another environment with RTX 3090. For scalar_rgb variant it will have the same segmentation fault and the gdb trace seems to be the same. However, when I switched to cuda_ad_rgb it worked well without any error.

This is the info of the environment:

System information:

  OS: Ubuntu 20.04.2 LTS
  CPU: AMD EPYC 7402 24-Core Processor
  GPU: NVIDIA GeForce RTX 3090
  Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
  NVidia driver: 515.43.04
  LLVM: 12.0.0

  Dr.Jit: 0.2.2
  Mitsuba: 3.0.2
     Is custom build? True
     Compiled with: Clang 10.0.0
     Variants:
        scalar_rgb
        scalar_spectral
        cuda_ad_rgb
        llvm_ad_rgb

Relento avatar Sep 21 '22 18:09 Relento

I was expecting the full backtrace from GDB. For this could you please compile in DEBUG mode, run your script using gdb and then enter the backtrace command in gdb?

Also I would be curious to know whether this issue occurs with the llvm_ad_rgb variant as well.

Finally, you could play with the parallel argument of mi.load_dict() to see whether that makes a difference.

Let me know about your findings, it would be great to get this bug fixed! ⚔️

Speierers avatar Sep 22 '22 07:09 Speierers

This is the backtrace for scalar_rgb:

#0  0x00007fffea2a91ec in embree::TaskScheduler::steal_from_other_threads(embree::TaskScheduler::Thread&) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#1  0x00007fffea2a89af in embree::TaskScheduler::thread_loop(unsigned long) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#2  0x00007fffea2a900a in embree::TaskScheduler::join() ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#3  0x00007fffe9a76d19 in embree::Scene::commit(bool) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#4  0x00007fffe9a421c1 in rtcJoinCommitScene ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#5  0x00007ffff7fb01f0 in pool_execute_task(Pool*, bool (*)(void*), void*) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#6  0x00007ffff7fb03d3 in Worker::run() ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#7  0x00007ffff7fb0478 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (Worker::*)(), Worker*> >(void*) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#8  0x00007ffff7f90609 in start_thread ()

And this is the backtrace for llvm_ad_rgb:

#0  0x00007fffea2a91ec in embree::TaskScheduler::steal_from_other_threads(embree::TaskScheduler::Thread&) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#1  0x00007fffea2a89af in embree::TaskScheduler::thread_loop(unsigned long) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#2  0x00007fffea2a900a in embree::TaskScheduler::join() ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#3  0x00007fffe9a76d19 in embree::Scene::commit(bool) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#4  0x00007fffe9a421c1 in rtcJoinCommitScene ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libembree3.so
#5  0x00007ffff7fb01f0 in pool_execute_task(Pool*, bool (*)(void*), void*) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#6  0x00007ffff7fb03d3 in Worker::run() ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#7  0x00007ffff7fb0478 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (Worker::*)(), Worker*> >(void*) ()
   from /sailhome/rcwang/data/labs/mitsuba/mitsuba3/build/libnanothread.so
#8  0x00007ffff7f90609 in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007ffff7eb7103 in clone () from /lib/x86_64-linux-gnu/libc.so.6

There is no parallel argument for mi.load_dict(). I try to play around the parallel argument in the mi.load_file function but the issue still persists regardless of the argument.

Relento avatar Sep 23 '22 02:09 Relento

I encountered issue by calling render twice (on the same scene with different parameters), and then writing both images out to file via util.convert_to_bitmap. If restricted to one core via taskset 1, this crashes ~10% of the time. The issue seems to go away if you intersperse the calls to render with the file saving.

kach avatar Oct 05 '23 16:10 kach

Hi @kach

Could you share some pseudocode just to make sure I understand correctly? I'll try to see if I can replicate this bug.

njroussel avatar Oct 06 '23 07:10 njroussel

Something like this:

scene = make_scene()
params = mi.traverse(scene)

install_param_settings_A(params); params.update()
img_A = mi.render(scene, params)

install_param_settings_B(params); params.update()
img_B = mi.render(scene, params)

plg.imsave('A.png', mi.util.convert_to_bitmap(img_A))
plg.imsave('B.png', mi.util.convert_to_bitmap(img_B))

kach avatar Oct 10 '23 00:10 kach

I'm unable to reproduce this.

Please open a new issue with all the required information. I'm not sure what plg is or does, but we include utilities to write images to a file:

njroussel avatar Oct 10 '23 06:10 njroussel

Oh, I'm sorry, I meant to type plt as in Matplotlib. But that's good to know about the built-in utilities to write images to files — thanks! :)

I spent some time trying to get a minimal crashing program, but no luck… I will open a new issue if I come up with something useful for you.

Thanks again!

kach avatar Oct 11 '23 03:10 kach