mitsuba2 icon indicating copy to clipboard operation
mitsuba2 copied to clipboard

Cuda Memory Alloc Erorr

Open acdupont opened this issue 4 years ago • 5 comments

I am trying to render the cbox-spectral on the GPU, and I am getting a cuda_malloc error. Debug printouts show that 6.5664 GB of memory is freed after the allocation error which is a huge amount of memory considering such a small scene.

The gpu configuration is as follows:

    "gpu_spectral": {
        "float": "CUDAArray<float>",
        "spectrum": "Spectrum<Float, 16>"
    },

Here is output from debug level 2:

2020-07-07 11:02:30 INFO  main  [mitsuba.cpp:206] Mitsuba version 2.1.0 (build-experiment[d7b5b97e], Linux, 64bit, 12 threads, 16-wide SIMD)
2020-07-07 11:02:30 INFO  main  [mitsuba.cpp:207] Copyright 2020, Realistic Graphics Lab, EPFL
2020-07-07 11:02:30 INFO  main  [mitsuba.cpp:208] Enabled processor features: cuda avx512f avx512cd avx512dq avx512vl avx512bw avx2 avx fma f16c sse4.2 x86_64
[31m2020-07-07 11:02:30 WARN  main  [mitsuba.cpp:211] Renderer is compiled in debug mode, performance will be considerably reduced.
[0m2020-07-07 11:02:30 INFO  main  [xml.cpp:1182] Loading XML file "scenes/cbox/cbox-spectral.xml" ..
2020-07-07 11:02:30 INFO  main  [xml.cpp:1183] Using variant "gpu_spectral"
2020-07-07 11:02:30 INFO  main  [xml.cpp:354] "scenes/cbox/cbox-spectral.xml": in-memory version upgrade (v2.0.0 -> v2.1.0) ..
2020-07-07 11:02:30 INFO  main  [xml.cpp:639] Loading included XML file "scenes/cbox/fragments/base.xml" ..
2020-07-07 11:02:30 INFO  main  [xml.cpp:354] "scenes/cbox/fragments/base.xml": in-memory version upgrade (v2.0.0 -> v2.1.0) ..
2020-07-07 11:02:30 INFO  main  [xml.cpp:639] Loading included XML file "scenes/cbox/fragments/bsdfs-spectral.xml" ..
2020-07-07 11:02:30 INFO  main  [xml.cpp:354] "scenes/cbox/fragments/bsdfs-spectral.xml": in-memory version upgrade (v2.0.0 -> v2.1.0) ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/regular.so" ..
2020-07-07 11:02:30 INFO  main  [xml.cpp:639] Loading included XML file "scenes/cbox/fragments/shapes.xml" ..
2020-07-07 11:02:30 INFO  main  [xml.cpp:354] "scenes/cbox/fragments/shapes.xml": in-memory version upgrade (v2.0.0 -> v2.1.0) ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/path.so" ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/independent.so" ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/box.so" ..
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 187 us, ptx compilation: 511 us, 14 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 87 us, ptx compilation: 127 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 57 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 129 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 48 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 121 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 120 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 120 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 48 us, ptx compilation: 120 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 117 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 47 us, ptx compilation: 117 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 45 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 47 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 45 us, ptx compilation: 117 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 117 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 116 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 60 us, ptx compilation: 119 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 118 us, 15 registers
cuda_eval(): launching kernel (n=1, in=0, out=3, ops=8)
cuda_jit_run(): cache hit, jit: 46 us, ptx compilation: 117 us, 15 registers
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/hdrfilm.so" ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/perspective.so" ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/diffuse.so" ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/area.so" ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/d65.so" ..
2020-07-07 11:02:30 INFO  main  [PluginManager] Loading plugin "plugins/obj.so" ..
[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] Loading mesh from "cbox_luminaire.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_luminaire.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 866 us, ptx compilation: 196 us, 40 registers
[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_luminaire.obj": computed vertex normals (took 2ms)
[0mcuda_eval(): begin parallel group
cuda_eval(): launching kernel (n=4, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 91 us, ptx compilation: 130 us, 19 registers
cuda_eval(): launching kernel (n=2, in=2, out=2, ops=111)
cuda_jit_run(): cache hit, jit: 260 us, ptx compilation: 141 us, 22 registers
cuda_eval(): launching kernel (n=1, in=0, out=1, ops=1)
cuda_jit_run(): cache hit, jit: 28 us, ptx compilation: 114 us, 7 registers
cuda_eval(): end parallel group
[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] Loading mesh from "cbox_floor.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_floor.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 767 us
[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_floor.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] Loading mesh from "cbox_ceiling.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_ceiling.obj": read 4 faces, 8 vertices (240 B in 0ms)
[0mcuda_eval(): launching kernel (n=4, in=9, out=27, ops=389)
cuda_jit_run(): cache hit, jit: 887 us, ptx compilation: 193 us, 40 registers
[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_ceiling.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] Loading mesh from "cbox_back.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_back.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): begin parallel group
cuda_eval(): launching kernel (n=8, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 100 us
cuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 742 us
cuda_eval(): end parallel group
[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_back.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] Loading mesh from "cbox_greenwall.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_greenwall.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): begin parallel group
cuda_eval(): launching kernel (n=4, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 91 us
cuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 752 us
cuda_eval(): end parallel group
[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_greenwall.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] Loading mesh from "cbox_redwall.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_redwall.obj": read 2 faces, 4 vertices (120 B in 0ms)
[0mcuda_eval(): begin parallel group
cuda_eval(): launching kernel (n=4, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 90 us
cuda_eval(): launching kernel (n=2, in=5, out=24, ops=358)
cuda_jit_run(): cache hit, jit: 721 us
cuda_eval(): end parallel group
[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_redwall.obj": computed vertex normals (took 2ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] Loading mesh from "cbox_smallbox.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_smallbox.obj": read 12 faces, 24 vertices (720 B in 0ms)
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] Loading mesh from "cbox_largebox.obj" ..
[0m[38;5;245m2020-07-07 11:02:30 DEBUG main  [OBJMesh] "cbox_largebox.obj": read 12 faces, 24 vertices (720 B in 0ms)
[0m2020-07-07 11:02:30 INFO  main  [Scene] Building scene in OptiX ..
cuda_eval(): launching kernel (n=4, in=4, out=3, ops=31)
cuda_jit_run(): cache hit, jit: 122 us
cuda_eval(): begin parallel group
[31m
Caught a critical exception: cuda_malloc(): out of memory!
[0mcuda_shutdown()

additionally, here is the last few lines of debug level 4:

...
cuda_eval(): allocated variable 4452 -> 0x6c9028000000 (67108864 bytes)
cuda_sync().
cuda_malloc_trim(): freed 49 arrays (2.0052 MiB device memory, 8 B unified memory, and 880 B host memory).
[31m
Caught a critical exception: cuda_malloc(): out of memory!
[0mcuda_shutdown()
cuda_sync().
cuda_malloc_trim(): freed 168 arrays (6.5664 GiB device memory, 7.6719 KiB unified memory, and 9.2578 KiB host memory).
cuda_sync().

Here are my gpu specs from nvidia-smi:

Tue Jul  7 19:57:57 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P4000        Off  | 00000000:91:00.0  On |                  N/A |
| 50%   49C    P0    29W / 105W |   1564MiB /  8111MiB |     13%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       625      G   /usr/lib/Xorg                                898MiB |
|    0      1281      G   /usr/bin/kwin_x11                            108MiB |
|    0      1283      G   /usr/bin/plasmashell                          95MiB |
|    0      1663      G   ...uest-channel-token=17038705297215317792   455MiB |
+-----------------------------------------------------------------------------+

acdupont avatar Jul 08 '20 00:07 acdupont

Hi @acdupont,

Your GPU is simply running out of memory here. Using 16 wavelenghts in Spectrum<Float, 16> will definitely increase the memory usage of the renderer.

You can also try to lower the amount of samples per pixel, or the resolution of the renders. Otherwise, there is also the possibility to split the rendering process in multiple passes using the samples_per_pass property of the integrator.

Speierers avatar Jul 14 '20 06:07 Speierers

Nice, I didn't know about the samples_per_pass setting. I successfully rendered on the GPU with the following settings: 256x256 resolution 256 samples per pixel 16 samples_per_pass

But this is low resolution and samples per pixel. I experimented with different settings like 1024x1024 image, or 512 samples per pixel, and I get a GPU allocation error. I then looked at how much memory was used by the CPU, using the following settings: 1024x1024 resolution 256 samples per pixel

I observe about 22M of memory usage in my system monitor.

I tried this run on my GPU using 8 samples_per_pass, but got the memory allocation error. I have 8G of memory on my GPU, why is so much more memory being used on the GPU? If 22M of memory is used when running on the CPU, is it expected that running on the GPU will take 8G+ memory?

acdupont avatar Jul 14 '20 17:07 acdupont

Hi, I have a laptop with i9-8950HK CPU(6 cores, 12 threads) and 1080 GPU. When I tried gpu_rgb mode on cbox.xml, I found it was about only 2x faster than scalar_rgb mode(no embree), is it normal? Because I supposed it would be 20-30x faster....

wangchi87 avatar Jul 29 '20 08:07 wangchi87

We are currently in the process of refactoring the whole codebase on top of a full rewrite of the enoki library. This will very likely improve both performance and memory consumption for gpu_* modes.

It is still going to take us a few weeks before we can release it. Stay tuned 😉

Speierers avatar Aug 10 '20 08:08 Speierers

Hi, I have a laptop with i9-8950HK CPU(6 cores, 12 threads) and 1080 GPU. When I tried gpu_rgb mode on cbox.xml, I found it was about only 2x faster than scalar_rgb mode(no embree), is it normal? Because I supposed it would be 20-30x faster....

I have a CPU with 6 cores and 12 threads and a GTX 1080, too. But my gpu_rgb version is even slower than scalar_rgb on cbox.xml.. and it won't let me do 256 res/ 256 spp without samples_per_pass.

RiverIntheSky avatar Aug 20 '20 21:08 RiverIntheSky