mitsuba2
mitsuba2 copied to clipboard
Cuda Memory Alloc Erorr
I am trying to render the cbox-spectral on the GPU, and I am getting a cuda_malloc error. Debug printouts show that 6.5664 GB of memory is freed after the allocation error which is a huge amount of memory considering such a small scene.
The gpu configuration is as follows:
"gpu_spectral": {
"float": "CUDAArray<float>",
"spectrum": "Spectrum<Float, 16>"
},
Here is output from debug level 2:
2020-07-07 11:02:30 INFO main [mitsuba.cpp:206] Mitsuba version 2.1.0 (build-experiment[d7b5b97e], Linux, 64bit, 12 threads, 16-wide SIMD)
2020-07-07 11:02:30 INFO main [mitsuba.cpp:207] Copyright 2020, Realistic Graphics Lab, EPFL
2020-07-07 11:02:30 INFO main [mitsuba.cpp:208] Enabled processor features: cuda avx512f avx512cd avx512dq avx512vl avx512bw avx2 avx fma f16c sse4.2 x86_64
[31m2020-07-07 11:02:30 WARN main [mitsuba.cpp:211] Renderer is compiled in debug mode, performance will be considerably reduced.
[0m2020-07-07 11:02:30 INFO main [xml.cpp:1182] Loading XML file "scenes/cbox/cbox-spectral.xml" ..
2020-07-07 11:02:30 INFO main [xml.cpp:1183] Using variant "gpu_spectral"
2020-07-07 11:02:30 INFO main [xml.cpp:354] "scenes/cbox/cbox-spectral.xml": in-memory version upgrade (v2.0.0 -> v2.1.0) ..
2020-07-07 11:02:30 INFO main [xml.cpp:639] Loading included XML file "scenes/cbox/fragments/base.xml" ..
2020-07-07 11:02:30 INFO main [xml.cpp:354] "scenes/cbox/fragments/base.xml": in-memory version upgrade (v2.0.0 -> v2.1.0) ..
2020-07-07 11:02:30 INFO main [xml.cpp:639] Loading included XML file "scenes/cbox/fragments/bsdfs-spectral.xml" ..
2020-07-07 11:02:30 INFO main [xml.cpp:354] "scenes/cbox/fragments/bsdfs-spectral.xml": in-memory version upgrade (v2.0.0 -> v2.1.0) ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/regular.so" ..
2020-07-07 11:02:30 INFO main [xml.cpp:639] Loading included XML file "scenes/cbox/fragments/shapes.xml" ..
2020-07-07 11:02:30 INFO main [xml.cpp:354] "scenes/cbox/fragments/shapes.xml": in-memory version upgrade (v2.0.0 -> v2.1.0) ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/path.so" ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/independent.so" ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/box.so" ..
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 187 us, ptx compilation: 511 us, 14 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 87 us, ptx compilation: 127 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 57 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 129 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 48 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 121 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 120 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 120 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 48 us, ptx compilation: 120 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 117 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 47 us, ptx compilation: 117 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 45 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 47 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 45 us, ptx compilation: 117 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 117 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 60 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 117 us, 15 registers
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/hdrfilm.so" ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/perspective.so" ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/diffuse.so" ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/area.so" ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/d65.so" ..
2020-07-07 11:02:30 INFO main [PluginManager] Loading plugin "plugins/obj.so" ..
[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] Loading mesh from "cbox_luminaire.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_luminaire.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 866 us, ptx compilation: 196 us, 40 registers
[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_luminaire.obj": computed vertex normals (took 2ms)
[0mcuda_eval(): begin parallel group
cuda_eval(): launching kernel (n=4, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 91 us, ptx compilation: 130 us, 19 registers
cuda_eval(): launching kernel (n=2, in=2, out=2, ops=111)
cuda_jit_run(): cache hit, jit: 260 us, ptx compilation: 141 us, 22 registers
cuda_eval(): launching kernel (n=1, in=0, out=1, ops=1)
cuda_jit_run(): cache hit, jit: 28 us, ptx compilation: 114 us, 7 registers
cuda_eval(): end parallel group
[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] Loading mesh from "cbox_floor.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_floor.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 767 us
[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_floor.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] Loading mesh from "cbox_ceiling.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_ceiling.obj": read 4 faces, 8 vertices (240 B in 0ms)
[0mcuda_eval(): launching kernel (n=4, in=9, out=27, ops=389)
cuda_jit_run(): cache hit, jit: 887 us, ptx compilation: 193 us, 40 registers
[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_ceiling.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] Loading mesh from "cbox_back.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_back.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): begin parallel group
cuda_eval(): launching kernel (n=8, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 100 us
cuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 742 us
cuda_eval(): end parallel group
[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_back.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] Loading mesh from "cbox_greenwall.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_greenwall.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): begin parallel group
cuda_eval(): launching kernel (n=4, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 91 us
cuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 752 us
cuda_eval(): end parallel group
[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_greenwall.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] Loading mesh from "cbox_redwall.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_redwall.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): begin parallel group
cuda_eval(): launching kernel (n=4, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 90 us
cuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 721 us
cuda_eval(): end parallel group
[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_redwall.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] Loading mesh from "cbox_smallbox.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_smallbox.obj": read 12 faces, 24 vertices (720 B in 0ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] Loading mesh from "cbox_largebox.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main [OBJMesh] "cbox_largebox.obj": read 12 faces, 24 vertices (720 B in 0ms)
[0m2020-07-07 11:02:30 INFO main [Scene] Building scene in OptiX ..
cuda_eval(): launching kernel (n=4, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 122 us
cuda_eval(): begin parallel group
[31m
Caught a critical exception: cuda_malloc(): out of memory!
[0mcuda_shutdown()
additionally, here is the last few lines of debug level 4:
...
cuda_eval(): allocated variable 4452 -> 0x6c9028000000 (67108864 bytes)
cuda_sync().
cuda_malloc_trim(): freed 49 arrays (2.0052 MiB device memory, 8 B unified memory, and 880 B host memory).
[31m
Caught a critical exception: cuda_malloc(): out of memory!
[0mcuda_shutdown()
cuda_sync().
cuda_malloc_trim(): freed 168 arrays (6.5664 GiB device memory, 7.6719 KiB unified memory, and 9.2578 KiB host memory).
cuda_sync().
Here are my gpu specs from nvidia-smi
:
Tue Jul 7 19:57:57 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:91:00.0 On | N/A |
| 50% 49C P0 29W / 105W | 1564MiB / 8111MiB | 13% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 625 G /usr/lib/Xorg 898MiB |
| 0 1281 G /usr/bin/kwin_x11 108MiB |
| 0 1283 G /usr/bin/plasmashell 95MiB |
| 0 1663 G ...uest-channel-token=17038705297215317792 455MiB |
+-----------------------------------------------------------------------------+
Hi @acdupont,
Your GPU is simply running out of memory here. Using 16 wavelenghts in Spectrum<Float, 16>
will definitely increase the memory usage of the renderer.
You can also try to lower the amount of samples per pixel, or the resolution of the renders. Otherwise, there is also the possibility to split the rendering process in multiple passes using the samples_per_pass
property of the integrator.
Nice, I didn't know about the samples_per_pass
setting. I successfully rendered on the GPU with the following settings:
256x256 resolution
256 samples per pixel
16 samples_per_pass
But this is low resolution and samples per pixel. I experimented with different settings like 1024x1024 image, or 512 samples per pixel, and I get a GPU allocation error. I then looked at how much memory was used by the CPU, using the following settings: 1024x1024 resolution 256 samples per pixel
I observe about 22M of memory usage in my system monitor.
I tried this run on my GPU using 8 samples_per_pass, but got the memory allocation error. I have 8G of memory on my GPU, why is so much more memory being used on the GPU? If 22M of memory is used when running on the CPU, is it expected that running on the GPU will take 8G+ memory?
Hi, I have a laptop with i9-8950HK CPU(6 cores, 12 threads) and 1080 GPU. When I tried gpu_rgb mode on cbox.xml, I found it was about only 2x faster than scalar_rgb mode(no embree), is it normal? Because I supposed it would be 20-30x faster....
We are currently in the process of refactoring the whole codebase on top of a full rewrite of the enoki library. This will very likely improve both performance and memory consumption for gpu_*
modes.
It is still going to take us a few weeks before we can release it. Stay tuned 😉
Hi, I have a laptop with i9-8950HK CPU(6 cores, 12 threads) and 1080 GPU. When I tried gpu_rgb mode on cbox.xml, I found it was about only 2x faster than scalar_rgb mode(no embree), is it normal? Because I supposed it would be 20-30x faster....
I have a CPU with 6 cores and 12 threads and a GTX 1080, too. But my gpu_rgb
version is even slower than scalar_rgb
on cbox.xml.. and it won't let me do 256 res/ 256 spp without samples_per_pass
.