threestudio icon indicating copy to clipboard operation
threestudio copied to clipboard

Stabilize the GPU memory usage

Open DSaurus opened this issue 1 year ago • 1 comments

The GPU memory usage during training with high-resolution settings is currently unstable, leading to OOM errors. The primary factor contributing to this issue is the grid_prune function in Nerfacc, resulting in an uncertain number of points in each iteration. To address this issue and stabilize the GPU memory usage, I propose the following methods:

  • Limiting the maximum number of sampling points that require gradient computation.
  • Dividing the entire rendering image into multiple blocks to stabilize the peak memory usage.

Based on my experiment with NeRF 512x512 rendering, using 6000000 points for gradient calculation consumes about 21GB of memory, while Nerf integration requires about 16 GB. Therefore, my configuration is as follows:

renderer_type: "patch-renderer"
  renderer:
    mode: "interval"
    block_nums: [3,3]
    base_renderer_type: "nerf-volume-renderer"
    base_renderer:
      radius: ${system.geometry.radius}
      num_samples_per_ray: 512
      train_max_nums: 6000000

By utilizing the proposed configuration, the memory usage will be stabilized at ~23GB (21 + 16/(3*3)).

memory

DSaurus avatar Jun 23 '23 17:06 DSaurus

Here are the results I obtained:

https://github.com/threestudio-project/threestudio/assets/24589363/eb4ab49e-d07b-460e-8b33-c0e3450edf41

DSaurus avatar Jun 23 '23 18:06 DSaurus

I see the current "interval" strategy for the nerf-renderer uses at most B // 2 samples for training. Do you think it could be better that we just randomly select train_max_nums samples?

thuliu-yt16 avatar Jul 18 '23 14:07 thuliu-yt16

I see the current "interval" strategy for the nerf-renderer uses at most B // 2 samples for training. Do you think it could be better that we just randomly select train_max_nums samples?

I have also implemented it by randomly selecting a continuous interval that includes train_max_nums samples. This method yields better results compared to the current implementation and I plan to update it later.

DSaurus avatar Jul 18 '23 15:07 DSaurus

This will be implemented in extensions.

DSaurus avatar Dec 02 '23 06:12 DSaurus