nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

Docker image: ns-process-data images error

Open ahfabi opened this issue 1 year ago β€’ 1 comments
trafficstars

Describe the bug ns-process-data images does not work anymore in docker image ghcr.io/nerfstudio-project/nerfstudio:latest it creates /processed/001/colmap /processed/001/images /processed/001/images_2 /processed/001/images_4 /processed/001/images_8 in the target directory but there is an error and consecutive ns-train nerfacto --data /workspace/processed/001/ fails.

I have no name!@224b575af291://$ ns-process-data images --data /workspace/input/ --output-dir /workspace/processed/001/
Matplotlib created a temporary cache directory at /tmp/matplotlib-ox6qfncq because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[12:14:17] πŸŽ‰ Done copying images with prefix 'frame_'.                                        process_data_utils.py:340
           πŸŽ‰ Done extracting COLMAP features.                                                       colmap_utils.py:137
Traceback (most recent call last):
  File "/usr/lib/python3.10/pathlib.py", line 1175, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/.local/share/nerfstudio'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/pathlib.py", line 1175, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/.local/share'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/ns-process-data", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/process_data.py", line 551, in entrypoint
    tyro.cli(Commands).main()
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/process_data/images_to_nerfstudio_dataset.py", line 114, in main
    self._run_colmap()
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/process_data/colmap_converter_to_nerfstudio_dataset.py", line 214, in _run_colmap
    colmap_utils.run_colmap(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/process_data/colmap_utils.py", line 146, in run_colmap
    vocab_tree_filename = get_vocab_tree()
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/process_data/colmap_utils.py", line 77, in get_vocab_tree
    vocab_tree_filename.parent.mkdir(parents=True, exist_ok=True)
  File "/usr/lib/python3.10/pathlib.py", line 1179, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/usr/lib/python3.10/pathlib.py", line 1179, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/usr/lib/python3.10/pathlib.py", line 1175, in mkdir
    self._accessor.mkdir(self, mode)
PermissionError: [Errno 13] Permission denied: '/.local'

To Reproduce Steps to reproduce the behavior: run ns-process-data images --data /workspace/input/ --output-dir /workspace/processed/001/

Expected behavior Processing without error

ahfabi avatar Sep 29 '24 14:09 ahfabi

Looks like the vocab tree is being built in the current working directory. Can you mount a random temp dir to docker and cd to the mounted path before running the command?

jkulhanek avatar Oct 23 '24 19:10 jkulhanek

Don't know the real fix but for anybody stuck you can bypass by running root. Start the container with the -u 0 option

NWalker4483 avatar Dec 06 '24 21:12 NWalker4483

What worked for me was to set the HOME env var to somewhere with write access

EStorm21 avatar Feb 05 '25 00:02 EStorm21

Sorry for the late reply, totally missed it and gave up.

@jkulhanek Not sure if I understand that: I created a directory /tmp/nerfstudiotmp on my host, ran nerfstudio with

sudo docker run --gpus all -u $(id -u) -v /home/ahfabi/Documents/nerfstudio/:/workspace/ -v /home/ahfabi/.cache/:/home/user/.cache/ -p 7007:7007 --rm -it --shm-size=12gb --mount type=tmpfs,destination=/tmp/nerfstudiotmp ghcr.io/nerfstudio-project/nerfstudio:latest

Inside of docker, I opened the directory /tmp/nerfstudiotmp and ran the ns-train command there. This was the output:

/tmp/nerfstudiotmp$ ns-train nerfacto --data /workspace/processed/test
Matplotlib created a temporary cache directory at /tmp/matplotlib-axljp85h because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[18:21:54] Using --data alias for --data.pipeline.datamanager.data                                          train.py:230
──────────────────────────────────────────────────────── Config ────────────────────────────────────────────────────────
TrainerConfig(
    _target=<class 'nerfstudio.engine.trainer.Trainer'>,
    output_dir=PosixPath('outputs'),
    method_name='nerfacto',
    experiment_name=None,
    project_name='nerfstudio-project',
    timestamp='2025-02-16_182154',
    machine=MachineConfig(seed=42, num_devices=1, num_machines=1, machine_rank=0, dist_url='auto', device_type='cuda'),
    logging=LoggingConfig(
        relative_log_dir=PosixPath('.'),
        steps_per_log=10,
        max_buffer_size=20,
        local_writer=LocalWriterConfig(
            _target=<class 'nerfstudio.utils.writer.LocalWriter'>,
            enable=True,
            stats_to_track=(
                <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>,
                <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>,
                <EventName.CURR_TEST_PSNR: 'Test PSNR'>,
                <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>,
                <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>,
                <EventName.ETA: 'ETA (time)'>
            ),
            max_log_size=10
        ),
        profiler='basic'
    ),
    viewer=ViewerConfig(
        relative_log_filename='viewer_log_filename.txt',
        websocket_port=None,
        websocket_port_default=7007,
        websocket_host='0.0.0.0',
        num_rays_per_chunk=32768,
        max_num_display_images=512,
        quit_on_train_completion=False,
        image_format='jpeg',
        jpeg_quality=75,
        make_share_url=False,
        camera_frustum_scale=0.1,
        default_composite_depth=True
    ),
    pipeline=VanillaPipelineConfig(
        _target=<class 'nerfstudio.pipelines.base_pipeline.VanillaPipeline'>,
        datamanager=ParallelDataManagerConfig(
            _target=<class 'nerfstudio.data.datamanagers.parallel_datamanager.ParallelDataManager'>,
            data=PosixPath('/workspace/processed/test'),
            masks_on_gpu=False,
            images_on_gpu=False,
            dataparser=NerfstudioDataParserConfig(
                _target=<class 'nerfstudio.data.dataparsers.nerfstudio_dataparser.Nerfstudio'>,
                data=PosixPath('.'),
                scale_factor=1.0,
                downscale_factor=None,
                scene_scale=1.0,
                orientation_method='up',
                center_method='poses',
                auto_scale_poses=True,
                eval_mode='fraction',
                train_split_fraction=0.9,
                eval_interval=8,
                depth_unit_scale_factor=0.001,
                mask_color=None,
                load_3D_points=False
            ),
            train_num_rays_per_batch=4096,
            train_num_images_to_sample_from=-1,
            train_num_times_to_repeat_images=-1,
            eval_num_rays_per_batch=4096,
            eval_num_images_to_sample_from=-1,
            eval_num_times_to_repeat_images=-1,
            eval_image_indices=(0,),
            collate_fn=<function nerfstudio_collate at 0x7b7950dfe200>,
            camera_res_scale_factor=1.0,
            patch_size=1,
            camera_optimizer=None,
            pixel_sampler=PixelSamplerConfig(
                _target=<class 'nerfstudio.data.pixel_samplers.PixelSampler'>,
                num_rays_per_batch=4096,
                keep_full_image=False,
                is_equirectangular=False,
                ignore_mask=False,
                fisheye_crop_radius=None,
                rejection_sample_mask=True,
                max_num_iterations=100
            ),
            num_processes=1,
            queue_size=2,
            max_thread_workers=None
        ),
        model=NerfactoModelConfig(
            _target=<class 'nerfstudio.models.nerfacto.NerfactoModel'>,
            enable_collider=True,
            collider_params={'near_plane': 2.0, 'far_plane': 6.0},
            loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0},
            eval_num_rays_per_chunk=32768,
            prompt=None,
            near_plane=0.05,
            far_plane=1000.0,
            background_color='last_sample',
            hidden_dim=64,
            hidden_dim_color=64,
            hidden_dim_transient=64,
            num_levels=16,
            base_res=16,
            max_res=2048,
            log2_hashmap_size=19,
            features_per_level=2,
            num_proposal_samples_per_ray=(256, 96),
            num_nerf_samples_per_ray=48,
            proposal_update_every=5,
            proposal_warmup=5000,
            num_proposal_iterations=2,
            use_same_proposal_network=False,
            proposal_net_args_list=[
                {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 128, 'use_linear': False},
                {'hidden_dim': 16, 'log2_hashmap_size': 17, 'num_levels': 5, 'max_res': 256, 'use_linear': False}
            ],
            proposal_initial_sampler='piecewise',
            interlevel_loss_mult=1.0,
            distortion_loss_mult=0.002,
            orientation_loss_mult=0.0001,
            pred_normal_loss_mult=0.001,
            use_proposal_weight_anneal=True,
            use_appearance_embedding=True,
            use_average_appearance_embedding=True,
            proposal_weights_anneal_slope=10.0,
            proposal_weights_anneal_max_num_iters=1000,
            use_single_jitter=True,
            predict_normals=False,
            disable_scene_contraction=False,
            use_gradient_scaling=False,
            implementation='tcnn',
            appearance_embed_dim=32,
            average_init_density=0.01,
            camera_optimizer=CameraOptimizerConfig(
                _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>,
                mode='SO3xR3',
                trans_l2_penalty=0.01,
                rot_l2_penalty=0.001,
                optimizer=None,
                scheduler=None
            )
        )
    ),
    optimizers={
        'proposal_networks': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.01,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': ExponentialDecaySchedulerConfig(
                _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                lr_pre_warmup=1e-08,
                lr_final=0.0001,
                warmup_steps=0,
                max_steps=200000,
                ramp='cosine'
            )
        },
        'fields': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.01,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': ExponentialDecaySchedulerConfig(
                _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                lr_pre_warmup=1e-08,
                lr_final=0.0001,
                warmup_steps=0,
                max_steps=200000,
                ramp='cosine'
            )
        },
        'camera_opt': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.001,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': ExponentialDecaySchedulerConfig(
                _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                lr_pre_warmup=1e-08,
                lr_final=0.0001,
                warmup_steps=0,
                max_steps=5000,
                ramp='cosine'
            )
        }
    },
    vis='viewer',
    data=PosixPath('/workspace/processed/test'),
    prompt=None,
    relative_model_dir=PosixPath('nerfstudio_models'),
    load_scheduler=True,
    steps_per_save=2000,
    steps_per_eval_batch=500,
    steps_per_eval_image=500,
    steps_per_eval_all_images=25000,
    max_num_iterations=30000,
    mixed_precision=True,
    use_grad_scaler=False,
    save_only_latest_checkpoint=True,
    load_dir=None,
    load_step=None,
    load_config=None,
    load_checkpoint=None,
    log_gradients=False,
    gradient_accumulation_steps={},
    start_paused=False
)
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
           Saving config to: outputs/test/nerfacto/2025-02-16_182154/config.yml                 experiment_config.py:136
           Saving checkpoints to: outputs/test/nerfacto/2025-02-16_182154/nerfstudio_models               trainer.py:142
           Auto image downscale factor of 2                                                 nerfstudio_dataparser.py:484
Started threads
Setting up evaluation dataset...
Caching all 17 images.
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Traceback (most recent call last):
  File "/usr/local/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 262, in entrypoint
    main(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 247, in main
    launch(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/scripts/train.py", line 99, in train_loop
    trainer.setup()
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/engine/trainer.py", line 158, in setup
    self.pipeline = self.config.pipeline.setup(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/configs/base_config.py", line 53, in setup
    return self._target(self, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/pipelines/base_pipeline.py", line 270, in __init__
    self._model = config.model.setup(
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/configs/base_config.py", line 53, in setup
    return self._target(self, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/models/base_model.py", line 85, in __init__
    self.populate_modules()  # populate the modules
  File "/usr/local/lib/python3.10/dist-packages/nerfstudio/models/nerfacto.py", line 252, in populate_modules
    self.lpips = LearnedPerceptualImagePatchSimilarity(normalize=True)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/image/lpip.py", line 121, in __init__
    self.net = _NoTrainLpips(net=net_type)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/image/lpips.py", line 305, in __init__
    self.net = net_type(pretrained=not self.pnet_rand, requires_grad=self.pnet_tune)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/image/lpips.py", line 110, in __init__
    alexnet_pretrained_features = _get_net("alexnet", pretrained)
  File "/usr/local/lib/python3.10/dist-packages/torchmetrics/functional/image/lpips.py", line 57, in _get_net
    pretrained_features = getattr(tv, net)(weights=getattr(tv, _weight_map[net]).IMAGENET1K_V1).features
  File "/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py", line 142, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py", line 228, in inner_wrapper
    return builder(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torchvision/models/alexnet.py", line 117, in alexnet
    model.load_state_dict(weights.get_state_dict(progress=progress, check_hash=True))
  File "/usr/local/lib/python3.10/dist-packages/torchvision/models/_api.py", line 90, in get_state_dict
    return load_state_dict_from_url(self.url, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/hub.py", line 746, in load_state_dict_from_url
    os.makedirs(model_dir)
  File "/usr/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.10/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.cache'

Why can ns-process-data create files and directories but ns-train can't? Especially since it worked in the past. The workspace is in my host's home directory and docker has to be run as root anyway.

ahfabi avatar Feb 16 '25 19:02 ahfabi

@ahfabi Hi, I got the exact same issue and resolved it as follows:

1. When starting a container, mount /.cache/ and /.local to local host's directory.

docker run --gpus all \
            -u $(id -u) \
            -v /home/Codes/nerfstudio/data:/workspace/ \
            -v /home/Codes/nerfstudio/.cache/:/.cache/ \ # -> This line is important
            -v /home/Codes/nerfstudio/.local/:/.local/ \ # -> This line is also important
            -p 7007:7007 \
            --rm \
            -it \
            --shm-size=12gb \
            nerfstudio

2. In host's terminal (not in docker container), change file mode as like:

chmod -R 777 /home/Codes/nerfstudio/.cache/
chmod -R 777 /home/Codes/nerfstudio/.local/

mikigom avatar Mar 04 '25 08:03 mikigom

@mikigom Thanks for the suggestion! Since I am inexperienced with Docker, is it normal that you have to circumvent security with 777 or sudo? I was assuming an error in my command or some misconfiguration.

ahfabi avatar Mar 05 '25 03:03 ahfabi

@ahfabi As far as I know, generally speaking, you do not have to give everything 777 permissions or run Docker as root in most cases. By default, many Docker images are designed to run as root inside the container. However, the tutorial suggests it used the option -u $(id -u) for running container, meaning the container is running with host user’s UID. This mismatch in UID/GID can cause permission issues when the container tries to write to directories (e.g. /.cache/, /.local/) that are only writable by root in the container filesystem.

In short, the more generous solution is to match the UID and GID on the host and in the container more precisely. For example, use -u 1000:1000 if your host user is uid=1000 and gid=1000.

Related Link: https://stackoverflow.com/questions/51596279/docker-permission-denied-in-container

mikigom avatar Mar 05 '25 08:03 mikigom

So everybody should have encountered this issue - then why is the docker command from the tutorial not adjusted so it just works? Tried it with -u $(id -u):$(id -g) but the PermissionError is still there and also groups: cannot find name for group ID

that are only writable by root in the container filesystem.

It also happens with the outputs directory - I assume that should be in the working directory and writable by the normal host user.

The weird thing is that I remember it working without these things, why was that?

ahfabi avatar Mar 07 '25 01:03 ahfabi