InstantMesh icon indicating copy to clipboard operation
InstantMesh copied to clipboard

Requires 800 gigabytes of video memory

Open Mrguanglei opened this issue 5 months ago • 2 comments

I finished rendering and when I was ready to train nerf, I only used 20 data sets and found out that I needed quite a lot of memory. What happened? I need your help。

(instantmesh1) mrguanglei@guanglei:~/3D/InstantMesh$ python train.py --base configs/instant-nerf-large-train.yaml --gpus 0 --num_nodes 1 /home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( Seed set to 42 Running on GPUs 0 /home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( /home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( Some weights of ViTModel were not initialized from the model checkpoint at facebook/dino-vitb16 and are newly initialized: ['encoder.layer.10.adaLN_modulation.1.weig ht', 'encoder.layer.9.adaLN_modulation.1.bias', 'encoder.layer.5.adaLN_modulation.1.weight', 'encoder.layer.2.adaLN_modulation.1.weight', 'encoder.layer.3.adaLN_modu lation.1.bias', 'encoder.layer.10.adaLN_modulation.1.bias', 'encoder.layer.2.adaLN_modulation.1.bias', 'encoder.layer.11.adaLN_modulation.1.weight', 'encoder.layer.0 .adaLN_modulation.1.weight', 'encoder.layer.11.adaLN_modulation.1.bias', 'encoder.layer.6.adaLN_modulation.1.weight', 'encoder.layer.7.adaLN_modulation.1.bias', 'enc oder.layer.5.adaLN_modulation.1.bias', 'encoder.layer.7.adaLN_modulation.1.weight', 'encoder.layer.6.adaLN_modulation.1.bias', 'encoder.layer.0.adaLN_modulation.1.bi as', 'encoder.layer.1.adaLN_modulation.1.bias', 'encoder.layer.3.adaLN_modulation.1.weight', 'encoder.layer.9.adaLN_modulation.1.weight', 'encoder.layer.8.adaLN_modu lation.1.bias', 'encoder.layer.8.adaLN_modulation.1.weight', 'encoder.layer.4.adaLN_modulation.1.weight', 'encoder.layer.1.adaLN_modulation.1.weight', 'encoder.layer.4.adaLN_modulation.1.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. /home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1. You can also use weights=VGG16_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs ============= length of dataset 12 ============= ============= length of dataset 11 ============= accumulate_grad_batches = 1 ++++ NOT USING LR SCALING ++++ Setting learning rate to 4.00e-04 [rank: 0] Seed set to 42 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

You are using a CUDA device ('NVIDIA GeForce RTX 4060 Ti') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('mediu m' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision ============= length of dataset 12 ============= ============= length of dataset 11 ============= LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Project config model: base_learning_rate: 0.0004 target: src.model.MVRecon params: input_size: 320 render_size: 192 lrm_generator_config: target: src.models.lrm.InstantNeRF params: encoder_feat_dim: 768 encoder_freeze: false encoder_model_name: facebook/dino-vitb16 transformer_dim: 512 transformer_layers: 8 transformer_heads: 8 triplane_low_res: 32 triplane_high_res: 64 triplane_dim: 80 rendering_samples_per_ray: 128 data: target: src.data.objaverse.DataModuleFromConfig params: batch_size: 1 num_workers: 4 train: target: src.data.objaverse.ObjaverseData params: root_dir: /home/mrguanglei/3D/InstantMesh/data meta_fname: valid_paths.json input_image_dir: rendering_random_32views target_image_dir: rendering_random_32views input_view_num: 6 target_view_num: 4 total_view_n: 32 fov: 50 camera_rotation: true validation: false validation: target: src.data.objaverse.ValidationData params: root_dir: /home/mrguanglei/3D/InstantMesh/data/vaild input_view_num: 6 input_image_size: 320 fov: 30 lightning: modelcheckpoint: params: every_n_train_steps: 1000 save_top_k: -1 save_last: true callbacks: {} trainer: benchmark: true max_epochs: -1 gradient_clip_val: 1.0 val_check_interval: 1000 num_sanity_val_steps: 0 accumulate_grad_batches: 1 check_val_every_n_epoch: null accelerator: gpu devices: 1

| Name | Type | Params

0 | lrm_generator | InstantNeRF | 152 M 1 | lpips | LearnedPerceptualImagePatchSimilarity | 14.7 M

152 M Trainable params 14.7 M Non-trainable params 166 M Total params 667.701 Total estimated model params size (MB) Epoch 0: | | 0/? [00:00<?, ?it/s][rank0]: Traceback (most recent call last): [rank0]: File "/home/mrguanglei/3D/InstantMesh/train.py", line 284, in [rank0]: trainer.fit(model, data) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit [rank0]: call._call_and_handle_interrupt( [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt [rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch [rank0]: return function(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl [rank0]: self._run(model, ckpt_path=ckpt_path) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run [rank0]: results = self._run_stage() [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage [rank0]: self.fit_loop.run() [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run [rank0]: self.advance() [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance [rank0]: self.epoch_loop.run(self._data_fetcher) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run [rank0]: self.advance(data_fetcher) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
[rank0]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
[rank0]: self._optimizer_step(batch_idx, closure) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step [rank0]: call._call_lightning_module_hook( [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook [rank0]: output = fn(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step [rank0]: optimizer.step(closure=optimizer_closure) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step [rank0]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 264, in optimizer_step [rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
[rank0]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 117, in optimizer_step [rank0]: return optimizer.step(closure=closure, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper [rank0]: return wrapped(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper [rank0]: out = func(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad [rank0]: ret = func(self, *args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/adamw.py", line 165, in step [rank0]: loss = closure() [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 104, in _wrap_closure [rank0]: closure_result = closure() [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in call [rank0]: self._result = self.closure(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 126, in closure
[rank0]: step_output = self._step_fn() [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 315, in _training_step [rank0]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
[rank0]: output = fn(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 381, in training_step
[rank0]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 633, in call [rank0]: wrapper_output = wrapper_module(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward [rank0]: else self._run_ddp_forward(*inputs, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward [rank0]: return self.module(*inputs, **kwargs) # type: ignore[index] [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(args, **kwargs) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 626, in wrapped_forward
[rank0]: out = method(
_args, **_kwargs) [rank0]: File "/home/mrguanglei/3D/InstantMesh/src/model.py", line 196, in training_step [rank0]: lrm_generator_input, render_gt = self.prepare_batch_data(batch) [rank0]: File "/home/mrguanglei/3D/InstantMesh/src/model.py", line 84, in prepare_batch_data [rank0]: target_depths = v2.functional.resize( [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/transforms/v2/functional/_geometry.py", line 189, in resize
[rank0]: return kernel(inpt, size=size, interpolation=interpolation, max_size=max_size, antialias=antialias) [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/transforms/v2/functional/_geometry.py", line 254, in resize_image [rank0]: image = interpolate( [rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/functional.py", line 4028, in interpolate [rank0]: return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors) [rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 844.10 GiB. GPU

Mrguanglei avatar Sep 03 '24 05:09 Mrguanglei