kaolin-wisp copied to clipboard
Larger RAM usage with the new config system
I tried to run main_nerf.py in the main branch. But it suddenly stopped showing a one-word line Killed
. It is presumably due to RAM shortage, according to google. I checked the usage and it reached its limit immediately before the app stopped. Do you have any idea how to deal with this issue?
I followed all the installation procedures, including requirements_app.txt. main_nerf.py in the stable branch works without any problems. So, if the config system is the only major change between the main and stable branches, the issue should be caused by the new config system. I suppose you can reproduce the larger RAM usage in your environment.
I installed pyopengl_accelerate separately because a msg telling the module is missing appeared when I ran the stable main_nerf.py for the first time, but the conda env should be clean to run wisp apps.
I know the easiest solution is increasing RAM. But the stable config system works fine even with limited RAM. It would be great if I could also use the new one on the same machine since it looks much cleaner.
Thanks in advance!
Machine spec
- OS: Ubuntu 22.04 on WSL2 on Windows 11 22H2
- RAM: 16 GB (approx. 8 GB for WSL2)
- GPU: RTX 4070 Ti
- Cuda: 11.7
- Torch: 1.13.1
- Kaolin: 0.13.0
Reproduction steps
- Install Kaolin Wisp with requirements_app.txt
pip install pyopengl_accelerate
python app/nerf/main_nerf.py --dataset-path /path/to/lego/ --config app/nerf/configs/nerf_hash.yaml
Hi @barikata1984 ! Sorry for the delayed reply here - I suspect this is due to a configuration change (we set "high quality" as the new default): https://github.com/NVIDIAGameWorks/kaolin-wisp/commit/99639ae60de4d1c6f4f721e3b6d1004e258afa5b#diff-0e84d1aed551f592a75f92bacc6eed1545bdaeb03042d1fb2f6aa17343e5db8bR46
Can you try with a reduced sample-per-ray count?
python app/nerf/main_nerf.py --dataset-path /path/to/lego/ --config app/nerf/configs/nerf_hash.yaml --tracer.num_steps 512
I've also tracked all config updates here: https://kaolin-wisp.readthedocs.io/en/latest/pages/config_system.html#converting-older-configs-up-to-wisp-v1-0-2
Hi @orperel,
Thanks for your response.
I reduced sample-per-ray from 512 to 16, halving the value iteratively but the process got killed.
It looks like something happens when running train_dataset = instantiate(cfg.dataset, transform=dataset_transform)
in main_nerf.py
To see it, I added the following lines
+ print("Instantiating dataset_transform")
dataset_transform = instantiate(cfg.dataset_transform) # SampleRays creates batches of rays from the dataset
+ print("Instantiating train_dataset")
train_dataset = instantiate(cfg.dataset, transform=dataset_transform) # A Multiview dataset
in app/nerf/main_nerf.py
+ print("================= Flag 0 =================")
instance = instantiate(config, **overriden_args)
+ print("================= Flag 1 =================")
in wisp/config/utils.py
. The output is
$ python app/nerf/main_nerf.py --dataset-path /path/to/lego/ --config app/nerf/configs/nerf_hash.yaml --tracer.num_steps 16
constructor: OctreeAS.make_dense
level: 7
constructor: HashGrid.from_geometric
feature_dim: 2
num_lods: 16
multiscale_type: cat
feature_std: 0.01
feature_bias: 0.0
codebook_bitwidth: 19
min_grid_res: 16
max_grid_res: 2048
constructor: NeuralRadianceField
pos_embedder: none
view_embedder: positional
pos_multires: 10
view_multires: 4
position_input: False
activation_type: relu
layer_type: linear
hidden_dim: 64
num_layers: 1
prune_density_decay: 0.6
prune_min_density: 2.956033378250884
constructor: PackedRFTracer
raymarch_type: ray
num_steps: 16
step_size: 1.0
bg_color: black
constructor: NeRFSyntheticDataset
dataset_path: ../nerf_data/lego/
split: train
bg_color: white
mip: 0
dataset_num_workers: -1
transform: None
constructor: SampleRays
num_samples: 4096
constructor: RMSprop
lr: 0.001
alpha: 0.99
eps: 1e-08
weight_decay: 0.0
momentum: 0.0
batch_size: 1
num_workers: 0
exp_name: nerf-hash
mode: train
max_epochs: 100
save_every: -1
save_as_new: False
model_format: full
render_every: -1
valid_every: -1
enable_amp: True
profile_nvtx: True
grid_lr_weight: 100.0
prune_every: 100
random_lod: False
rgb_lambda: 1.0
constructor: _Tensorboard
log_dir: _results/logs/runs
constructor: _WandB
project: wisp-nerf
entity: None
run_name: None
job_type: train
sync_tensorboard: True
constructor: OfflineRenderer
render_res: (1024, 1024)
render_batch: 10000
shading_mode: rb
matcap_path: ./data/matcap/Pearl.png
shadow: False
ao: False
perf: False
camera_origin: (-3.0, 0.65, -3.0)
camera_lookat: (0.0, 0.0, 0.0)
camera_fov: 30.0
camera_clamp: (0.0, 10.0)
viz360_num_angles: 20
viz360_radius: 3.0
viz360_render_all_lods: False
enable_tensorboard: True
enable_wandb: False
log_dir: _results/logs/runs
log_level: 20
pretrained: None
device: cuda
interactive: True
Instantiating dataset_transform
================= Flag 1 =================
================= Flag 2 =================
Instantiating train_dataset
================= Flag 1 =================
loading data: 100%|████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 30.43it/s]
/home/atsushi/miniconda3/envs/wisp/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Do you have any other ideas to clear this issue?
Hi @barikata1984 thanks for this bug report.
I ran some memory profiling and indeed the main branches uses upwards of 14GB of resident memory at peak, which really shouldn't be the case.
I dug into the issue a bit and I fixed some benign issues in: https://github.com/NVIDIAGameWorks/kaolin-wisp/pull/164
Now the resident memory at least according to my profiling is 8GB (so a 6GB reduction). If you want further savings, I would pass in --valid-every -1
to disable validation, since the validation dataset takes around 3GB ish of memory.
Let me know if this works for you!
Hi @tovacinni, thanks a lot for the solution! As you suggested, --valid-every -1
worked while with validation running still got killed due to RAM shortage. I will try again on a different PC with sufficient RAM