nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

Training does not use the Nvidia GPU, is it normal?

Open mpizenberg opened this issue 2 years ago • 2 comments

When running ns-train nerfacto --data data/nerfstudio/poster, only the CPU of my laptop is getting some load. According to the task manager, the load of my Nvidia GPU is 0%. As a result, the training is quite slow, and I’m actually surprised that it works at all, considering CUDA was supposed to be required. Am I missing something? What info can I provide to help figure this one out?

OS : Windows 10 GPU : Nvidia RTX 3050 laptop Cuda : version returned by nvcc cuda_11.8.r11.8/compiler.31833905_0

Installation was done in a conda env as per the doc. Nerfstudio, installed nerfstudio==0.1.10 with pip

mpizenberg avatar Nov 21 '22 16:11 mpizenberg

It is not normal for the GPU to not work. I'm actually surprised that it is working at all given that TinyCudaNN requires a GPU. Can you post the full logs before training starts.

tancik avatar Nov 21 '22 16:11 tancik

Sure

image

Logs before training starts
❯ ns-train nerfacto --data data/nerfstudio/poster
[17:52:51] Using --data alias for --data.pipeline.datamanager.dataparser.data                               train.py:223
──────────────────────────────────────────────────────── Config ────────────────────────────────────────────────────────
Config(
    output_dir=WindowsPath('outputs'),
    method_name='nerfacto',
    experiment_name=None,
    timestamp='2022-11-21_175251',
    machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'),
    logging=LoggingConfig(
        relative_log_dir=WindowsPath('.'),
        steps_per_log=10,
        max_buffer_size=20,
        local_writer=LocalWriterConfig(
            _target=<class 'nerfstudio.utils.writer.LocalWriter'>,
            enable=True,
            stats_to_track=(
                <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>,
                <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>,
                <EventName.CURR_TEST_PSNR: 'Test PSNR'>,
                <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>,
                <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>
            ),
            max_log_size=10
        ),
        enable_profiler=True
    ),
    viewer=ViewerConfig(
        relative_log_filename='viewer_log_filename.txt',
        start_train=True,
        zmq_port=None,
        launch_bridge_server=True,
        websocket_port=7007,
        ip_address='127.0.0.1',
        num_rays_per_chunk=32768,
        max_num_display_images=512,
        quit_on_train_completion=False
    ),
    trainer=TrainerConfig(
        steps_per_save=2000,
        steps_per_eval_batch=500,
        steps_per_eval_image=500,
        steps_per_eval_all_images=25000,
        max_num_iterations=30000,
        mixed_precision=True,
        relative_model_dir=WindowsPath('nerfstudio_models'),
        save_only_latest_checkpoint=True,
        load_dir=None,
        load_step=None,
        load_config=None
    ),
    pipeline=VanillaPipelineConfig(
        _target=<class 'nerfstudio.pipelines.base_pipeline.VanillaPipeline'>,
        datamanager=VanillaDataManagerConfig(
            _target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>,
            dataparser=NerfstudioDataParserConfig(
                _target=<class 'nerfstudio.data.dataparsers.nerfstudio_dataparser.Nerfstudio'>,
                data=WindowsPath('data/nerfstudio/poster'),
                scale_factor=1.0,
                downscale_factor=None,
                scene_scale=1.0,
                orientation_method='up',
                center_poses=True,
                auto_scale_poses=True,
                train_split_percentage=0.9
            ),
            train_num_rays_per_batch=4096,
            train_num_images_to_sample_from=-1,
            train_num_times_to_repeat_images=-1,
            eval_num_rays_per_batch=4096,
            eval_num_images_to_sample_from=-1,
            eval_num_times_to_repeat_images=-1,
            eval_image_indices=(0,),
            camera_optimizer=CameraOptimizerConfig(
                _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>,
                mode='SO3xR3',
                position_noise_std=0.0,
                orientation_noise_std=0.0,
                optimizer=AdamOptimizerConfig(
                    _target=<class 'torch.optim.adam.Adam'>,
                    lr=0.0006,
                    eps=1e-08,
                    weight_decay=0.01
                ),
                scheduler=SchedulerConfig(
                    _target=<class 'nerfstudio.engine.schedulers.ExponentialDecaySchedule'>,
                    lr_final=5e-06,
                    max_steps=10000
                ),
                param_group='camera_opt'
            )
        ),
        model=NerfactoModelConfig(
            _target=<class 'nerfstudio.models.nerfacto.NerfactoModel'>,
            enable_collider=True,
            collider_params={'near_plane': 2.0, 'far_plane': 6.0},
            loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0},
            eval_num_rays_per_chunk=32768,
            near_plane=0.05,
            far_plane=1000.0,
            background_color='last_sample',
            num_proposal_samples_per_ray=(256, 96),
            num_nerf_samples_per_ray=48,
            proposal_update_every=5,
            proposal_warmup=5000,
            num_proposal_iterations=2,
            use_same_proposal_network=False,
            proposal_net_args_list=[
                {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 64},
                {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 256}
            ],
            interlevel_loss_mult=1.0,
            distortion_loss_mult=0.002,
            use_proposal_weight_anneal=True,
            use_average_appearance_embedding=True,
            proposal_weights_anneal_slope=10.0,
            proposal_weights_anneal_max_num_iters=1000,
            use_single_jitter=True
        )
    ),
    optimizers={
        'proposal_networks': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.01,
                eps=1e-15,
                weight_decay=0
            ),
            'scheduler': None
        },
        'fields': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.01,
                eps=1e-15,
                weight_decay=0
            ),
            'scheduler': None
        }
    },
    vis='viewer',
    data=WindowsPath('data/nerfstudio/poster')
)
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[17:52:51] Saving config to: outputs\data\nerfstudio\poster\nerfacto\2022-11-21_175251\config.yml     base_config.py:274
[17:52:51] Saving checkpoints to:                                                                          trainer.py:90
           outputs\data\nerfstudio\poster\nerfacto\2022-11-21_175251\nerfstudio_models
Using ZMQ port: 51327

========================================================================================================================
[Public] Open the viewer at https://viewer.nerf.studio/versions/22-11-10-0/?websocket_url=ws://localhost:7007
========================================================================================================================

Sending ping to the viewer Bridge Server...
Successfully connected.
Sending ping to the viewer Bridge Server...
Successfully connected.
[WARNING] Not running eval iterations since only viewer is enabled. Use `--vis wandb` or `--vis tensorboard` to run with
eval instead.
disabled tensorboard/wandb event writers
[17:52:52] Auto image downscale factor of 2                                                 nerfstudio_dataparser.py:202
           Skipping 0 files in dataset split train.                                          nerfstudio_dataparser.py:91
           Auto image downscale factor of 2                                                 nerfstudio_dataparser.py:202
           Skipping 0 files in dataset split val.                                            nerfstudio_dataparser.py:91
Setting up training dataset...
Caching all 204 images.
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Setting up evaluation dataset...
Caching all 22 images.
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
No checkpoints to load, training from scratch
[17:53:18] Printing max of 10 lines. Set flag  `--logging.local-writer.max-log-size=0` to disable line     writer.py:388
           wrapping.
testeing testing!

mpizenberg avatar Nov 21 '22 16:11 mpizenberg

Are you sure your GPU isn't being used? Your training time per iteration is ~90ms; for reference I'm using an NVIDIA Tesla (low-end) and it takes about ~250ms

Jordan-Pierce avatar Dec 01 '22 23:12 Jordan-Pierce

Are you sure your GPU isn't being used? Your training time per iteration is ~90ms; for reference I'm using an NVIDIA Titan 100 (low-end) and it takes about ~250ms

So it could be windows task manager not reporting GPU usage correctly? I’ll double check when I get the time. Today or next week.

mpizenberg avatar Dec 02 '22 09:12 mpizenberg

vram

DiHubKi avatar Dec 05 '22 05:12 DiHubKi

@Jordan-Pierce it seems like you are right. It is an issue with windows task manager reporting. I just tried with nvidia-smi during the training, and the GPU is at 70% load, and not 0%.

image

mpizenberg avatar Dec 05 '22 09:12 mpizenberg