BrokenPipeError occurs with wandb option

Open ZirongChan opened this issue 2 years ago • 5 comments

Thx for the great work.

I was running the toy_experiment with the lego data, expect that I ran colmap and then the data generation scripts on a dataset which I've downloaded long time ago for NeRF. So the data I used might be different from the one provided in this repo.

The toy experiment rans well, although the background remains. Issue occurs when I tried using the --wandb option. Data loaded, the communication with wandb website was fine too. Everything went fine until the 4th epoch of training. An error rised with "BrokenPipeError: [Errno 32] Broken Pipe". It tracked back to the torch.distributed.elastic.multiprocessing.error.ChildFailedError.

Does anyone have the same issue?

Aug 28 '23 10:08 ZirongChan

Hi @ZirongChan could you post the full error log? Thanks!

Aug 28 '23 16:08 chenhsuanlin

Hi @ZirongChan could you post the full error log? Thanks!

thx for your reply, @chenhsuanlin

Of course, the following is the log output in terminal: torchrun --nproc_per_node=1 train.py --logdir=logs/nerf_synthesis/lego_wandb --config=projects/neuralangelo/configs/custom/lego.yaml --show_pbar --wandb Training with 1 GPUs. Using random seed 0 Make folder logs/nerf_synthesis/lego_wandb

checkpoint:
- save_epoch: 9999999999
- save_iter: 20000
- save_latest_iter: 9999999999
- save_period: 9999999999
- strict_resume: True
cudnn:
- benchmark: True
- deterministic: False
data:
- name: dummy
- num_images: None
- num_workers: 4
- preload: True
- readjust:
  - center: [0.0, 0.0, 0.0]
  - scale: 1.0
- root: ./dataset/nerf_synthesis/lego
- train:
  - batch_size: 2
  - image_size: [800, 800]
  - subset: None
- type: projects.neuralangelo.data
- use_multi_epoch_loader: True
- val:
  - batch_size: 2
  - image_size: [300, 300]
  - max_viz_samples: 16
  - subset: 4
image_save_iter: 9999999999
inference_args:
local_rank: 0
logdir: logs/nerf_synthesis/lego_wandb
logging_iter: 9999999999999
max_epoch: 9999999999
max_iter: 500000
metrics_epoch: None
metrics_iter: None
model:
- appear_embed:
  - dim: 8
  - enabled: False
- background:
  - enabled: True
  - encoding:
    - levels: 10
    - type: fourier
  - encoding_view:
    - levels: 3
    - type: spherical
  - mlp:
    - activ: relu
    - activ_density: softplus
    - activ_density_params:
    - activ_params:
    - hidden_dim: 256
    - hidden_dim_rgb: 128
    - num_layers: 8
    - num_layers_rgb: 2
    - skip: [4]
    - skip_rgb: []
  - view_dep: True
  - white: False
- object:
  - rgb:
    - encoding_view:
      - levels: 3
      - type: spherical
    - mlp:
      - activ: relu_
      - activ_params:
      - hidden_dim: 256
      - num_layers: 4
      - skip: []
      - weight_norm: True
    - mode: idr
  - s_var:
    - anneal_end: 0.1
    - init_val: 3.0
  - sdf:
    - encoding:
      - coarse2fine:
        
        enabled: True
        
        init_active_level: 4
        
        step: 5000
      - hashgrid:
        
        dict_size: 22
        
        dim: 8
        
        max_logres: 11
        
        min_logres: 5
        
        range: [-2, 2]
      - levels: 16
      - type: hashgrid
    - gradient:
      - mode: numerical
      - taps: 4
    - mlp:
      - activ: softplus
      - activ_params:
        
        beta: 100
      - geometric_init: True
      - hidden_dim: 256
      - inside_out: False
      - num_layers: 1
      - out_bias: 0.5
      - skip: []
      - weight_norm: True
- render:
  - num_sample_hierarchy: 4
  - num_samples:
    - background: 32
    - coarse: 64
    - fine: 16
  - rand_rays: 512
  - stratified: True
- type: projects.neuralangelo.model
nvtx_profile: False
optim:
- fused_opt: False
- params:
  - lr: 0.001
  - weight_decay: 0.01
- sched:
  - gamma: 10.0
  - iteration_mode: True
  - step_size: 9999999999
  - two_steps: [300000, 400000]
  - type: two_steps_with_warmup
  - warm_up_end: 5000
- type: AdamW
pretrained_weight: None
source_filename: projects/neuralangelo/configs/custom/lego.yaml
speed_benchmark: False
test_data:
- name: dummy
- num_workers: 0
- test:
  - batch_size: 1
  - is_lmdb: False
  - roots: None
- type: imaginaire.datasets.images
timeout_period: 9999999
trainer:
- amp_config:
  - backoff_factor: 0.5
  - enabled: False
  - growth_factor: 2.0
  - growth_interval: 2000
  - init_scale: 65536.0
- ddp_config:
  - find_unused_parameters: False
  - static_graph: True
- depth_vis_scale: 0.5
- ema_config:
  - beta: 0.9999
  - enabled: False
  - load_ema_checkpoint: False
  - start_iteration: 0
- grad_accum_iter: 1
- image_to_tensorboard: False
- init:
  - gain: None
  - type: none
- loss_weight:
  - curvature: 0.0005
  - eikonal: 0.1
  - render: 1.0
- type: projects.neuralangelo.trainer
validation_iter: 5000
wandb_image_iter: 10000
wandb_scalar_iter: 100 cudnn benchmark: True cudnn deterministic: False Setup trainer. Using random seed 0 model parameter count: 366,702,732 Initialize model weights using type: none, gain: None Using random seed 0 Allow TensorFloat32 operations on supported devices Train dataset length: 61
Val dataset length: 4
Training from scratch. Initialize wandb wandb: Currently logged in as: chzirong. Use wandb login --relogin to force relogin wandb: wandb version 0.15.9 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.15.8 wandb: Run data is saved locally in logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj wandb: Run wandb offline to turn off syncing. wandb: Syncing run lego_wandb wandb: ⭐ View project at https://wandb.ai/chzirong/default wandb: 🚀 View run at https://wandb.ai/chzirong/default/runs/ox3ubipj Evaluating with 4 samples.
Epoch: 1, total time: 3.859701.
Epoch: 2, total time: 3.737751.
Epoch: 3, total time: 3.760936.
Traceback (most recent call last):
File "/zhanghuaimin01/Workspace/neuralangelo/train.py", line 104, in main() File "/zhanghuaimin01/Workspace/neuralangelo/train.py", line 93, in main trainer.train(cfg, File "/zhanghuaimin01/Workspace/neuralangelo/projects/neuralangelo/trainer.py", line 107, in train super().train(cfg, data_loader, single_gpu, profile, show_pbar) File "/zhanghuaimin01/Workspace/neuralangelo/projects/nerf/trainers/base.py", line 115, in train super().train(cfg, data_loader, single_gpu, profile, show_pbar) File "/zhanghuaimin01/Workspace/neuralangelo/imaginaire/trainers/base.py", line 512, in train self.end_of_iteration(data, current_epoch, current_iteration) File "/zhanghuaimin01/Workspace/neuralangelo/imaginaire/trainers/base.py", line 319, in end_of_iteration self._end_of_iteration(data, current_epoch, current_iteration) File "/zhanghuaimin01/Workspace/neuralangelo/projects/nerf/trainers/base.py", line 47, in _end_of_iteration self.log_wandb_scalars(data, mode="train") File "/zhanghuaimin01/Workspace/neuralangelo/imaginaire/utils/distributed.py", line 72, in wrapper return func(*args, **kwargs) File "/zhanghuaimin01/Workspace/neuralangelo/projects/neuralangelo/trainer.py", line 75, in log_wandb_scalars super().log_wandb_scalars(data, mode=mode) File "/zhanghuaimin01/Workspace/neuralangelo/imaginaire/utils/distributed.py", line 72, in wrapper return func(*args, **kwargs) File "/zhanghuaimin01/Workspace/neuralangelo/projects/nerf/trainers/base.py", line 84, in log_wandb_scalars wandb.log(scalars, step=self.current_iteration) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 390, in wrapper return func(self, *args, **kwargs) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 341, in wrapper_fn return func(self, *args, **kwargs) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 331, in wrapper return func(self, *args, **kwargs) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1752, in log self._log(data=data, step=step, commit=commit) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1527, in _log self._partial_history_callback(data, step, commit) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1397, in _partial_history_callback self._backend.interface.publish_partial_history( File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 653, in publish_partial_history self._publish_partial_history(partial_history) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history self._publish(rec) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe Exception in thread ChkStopThr: Traceback (most recent call last): File "/root/anaconda3/envs/neuralangelo/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/root/anaconda3/envs/neuralangelo/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 274, in check_stop_status self._loop_check_status( File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 212, in _loop_check_status local_handle = request() File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 856, in deliver_stop_status Exception in thread return self._deliver_stop_status(status) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 594, in _deliver_stop_status NetStatThr: return self._deliver_record(record) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 569, in _deliver_record Traceback (most recent call last): handle = mailbox._deliver_record(record, interface=self) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record File "/root/anaconda3/envs/neuralangelo/lib/python3.9/threading.py", line 980, in _bootstrap_inner interface._publish(record) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self.run() File "/root/anaconda3/envs/neuralangelo/lib/python3.9/threading.py", line 917, in run self._sock_client.send_record_publish(record) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self._target(*self._args, **self._kwargs) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 256, in check_network_status self.send_server_request(server_req) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._loop_check_status( File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 212, in _loop_check_status self._send_message(msg)
local_handle = request() File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message

File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 864, in deliver_network_status self._sendall_with_error_handle(header + data)
return self._deliver_network_status(status) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle

File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 610, in _deliver_network_status sent = self._sock.send(data) BrokenPipeErrorreturn self._deliver_record(record): [Errno 32] Broken pipe File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 569, in _deliver_record

handle = mailbox._deliver_record(record, interface=self) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record interface._publish(record) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63851) of binary: /root/anaconda3/envs/neuralangelo/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/neuralangelo/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/neuralangelo/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-08-29_02:26:52 host : a0q74jbdps9k3-0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 63851) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I will also paste the content of the "debug-internal.log" file: 2023-08-29 02:25:55,696 INFO StreamThr :64464 [internal.py:wandb_internal():86] W&B internal server running at pid: 64464, started at: 2023-08-29 02:25:55.694795 2023-08-29 02:25:55,697 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status 2023-08-29 02:25:55,700 INFO WriterThread:64464 [datastore.py:open_for_write():85] open: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/run-ox3ubipj.wandb 2023-08-29 02:25:55,702 DEBUG SenderThread:64464 [sender.py:send():379] send: header 2023-08-29 02:25:55,775 DEBUG SenderThread:64464 [sender.py:send():379] send: run 2023-08-29 02:26:00,776 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: keepalive 2023-08-29 02:26:05,778 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: keepalive 2023-08-29 02:26:08,240 INFO SenderThread:64464 [dir_watcher.py:init():211] watching files in: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files 2023-08-29 02:26:08,240 INFO SenderThread:64464 [sender.py:_start_run_threads():1121] run started: ox3ubipj with start time 1693275955.695242 2023-08-29 02:26:08,240 DEBUG SenderThread:64464 [sender.py:send_request():406] send_request: summary_record 2023-08-29 02:26:08,240 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:08,242 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file wandb-summary.json with policy end 2023-08-29 02:26:08,248 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: check_version 2023-08-29 02:26:08,249 DEBUG SenderThread:64464 [sender.py:send_request():406] send_request: check_version 2023-08-29 02:26:09,243 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/wandb-summary.json 2023-08-29 02:26:12,803 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: run_start 2023-08-29 02:26:12,808 DEBUG HandlerThread:64464 [system_info.py:init():31] System info init 2023-08-29 02:26:12,808 DEBUG HandlerThread:64464 [system_info.py:init():46] System info init done 2023-08-29 02:26:12,808 INFO HandlerThread:64464 [system_monitor.py:start():181] Starting system monitor 2023-08-29 02:26:12,808 INFO SystemMonitor:64464 [system_monitor.py:_start():145] Starting system asset monitoring threads 2023-08-29 02:26:12,808 INFO HandlerThread:64464 [system_monitor.py:probe():201] Collecting system info 2023-08-29 02:26:12,809 INFO SystemMonitor:64464 [interfaces.py:start():190] Started cpu monitoring 2023-08-29 02:26:12,810 INFO SystemMonitor:64464 [interfaces.py:start():190] Started disk monitoring 2023-08-29 02:26:12,810 INFO SystemMonitor:64464 [interfaces.py:start():190] Started gpu monitoring 2023-08-29 02:26:12,811 INFO SystemMonitor:64464 [interfaces.py:start():190] Started memory monitoring 2023-08-29 02:26:12,812 INFO SystemMonitor:64464 [interfaces.py:start():190] Started network monitoring 2023-08-29 02:26:12,839 DEBUG HandlerThread:64464 [system_info.py:probe():195] Probing system 2023-08-29 02:26:12,845 DEBUG HandlerThread:64464 [system_info.py:_probe_git():180] Probing git 2023-08-29 02:26:12,861 DEBUG HandlerThread:64464 [system_info.py:_probe_git():188] Probing git done 2023-08-29 02:26:12,861 DEBUG HandlerThread:64464 [system_info.py:probe():240] Probing system done 2023-08-29 02:26:12,861 DEBUG HandlerThread:64464 [system_monitor.py:probe():210] {'os': 'Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.27', 'python': '3.9.16', 'heartbeatAt': '2023-08-29T02:26:12.839871', 'startedAt': '2023-08-29T02:25:55.673713', 'docker': None, 'cuda': None, 'args': ('--logdir=logs/nerf_synthesis/lego_wandb', '--config=projects/neuralangelo/configs/custom/lego.yaml', '--show_pbar', '--wandb'), 'state': 'running', 'program': '/zhanghuaimin01/Workspace/neuralangelo/train.py', 'codePath': 'train.py', 'git': {'remote': 'https://github.com/NVlabs/neuralangelo.git', 'commit': 'f740c689808537074d46a9d56f8bec2c0be93c7e'}, 'email': '[email protected]', 'root': '/zhanghuaimin01/Workspace/neuralangelo', 'host': 'a0q74jbdps9k3-0', 'username': 'root', 'executable': '/root/anaconda3/envs/neuralangelo/bin/python', 'cpu_count': 64, 'cpu_count_logical': 128, 'cpu_freq': {'current': 3.399999999999993, 'min': 800.0, 'max': 3400.0}, 'cpu_freq_per_core': [{'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}], 'disk': {'total': 3539.7356147766113, 'used': 744.4414558410645}, 'gpu': 'NVIDIA A100-SXM4-40GB', 'gpu_count': 1, 'gpu_devices': [{'name': 'NVIDIA A100-SXM4-40GB', 'memory_total': 42505273344}], 'memory': {'total': 1007.3468627929688}} 2023-08-29 02:26:12,862 INFO HandlerThread:64464 [system_monitor.py:probe():211] Finished collecting system info 2023-08-29 02:26:12,862 INFO HandlerThread:64464 [system_monitor.py:probe():214] Publishing system info 2023-08-29 02:26:12,862 DEBUG HandlerThread:64464 [system_info.py:_save_pip():51] Saving list of pip packages installed into the current environment 2023-08-29 02:26:12,864 DEBUG HandlerThread:64464 [system_info.py:_save_pip():67] Saving pip packages done 2023-08-29 02:26:12,865 DEBUG HandlerThread:64464 [system_info.py:_save_conda():74] Saving list of conda packages installed into the current environment 2023-08-29 02:26:13,245 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/conda-environment.yaml 2023-08-29 02:26:13,245 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/requirements.txt 2023-08-29 02:26:19,113 DEBUG HandlerThread:64464 [system_info.py:_save_conda():86] Saving conda packages done 2023-08-29 02:26:19,117 INFO HandlerThread:64464 [system_monitor.py:probe():216] Finished publishing system info 2023-08-29 02:26:19,121 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:19,121 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: keepalive 2023-08-29 02:26:19,121 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:19,122 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:19,123 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file wandb-metadata.json with policy now 2023-08-29 02:26:19,127 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: stop_status 2023-08-29 02:26:19,127 DEBUG SenderThread:64464 [sender.py:send_request():406] send_request: stop_status 2023-08-29 02:26:19,249 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/conda-environment.yaml 2023-08-29 02:26:19,249 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/wandb-metadata.json 2023-08-29 02:26:19,825 DEBUG SenderThread:64464 [sender.py:send():379] send: telemetry 2023-08-29 02:26:19,825 DEBUG SenderThread:64464 [sender.py:send():379] send: config 2023-08-29 02:26:19,825 DEBUG SenderThread:64464 [sender.py:send():379] send: telemetry 2023-08-29 02:26:20,194 INFO wandb-upload_0:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/t9dpsmie-wandb-metadata.json 2023-08-29 02:26:20,250 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:22,252 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:23,828 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:28,261 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:29,300 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:30,265 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/config.yaml 2023-08-29 02:26:32,598 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: partial_history 2023-08-29 02:26:32,599 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: partial_history 2023-08-29 02:26:34,127 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: stop_status 2023-08-29 02:26:34,127 DEBUG SenderThread:64464 [sender.py:send_request():406] send_request: stop_status 2023-08-29 02:26:34,352 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:35,353 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,353 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/rgb_target_0_39365b313d2292dd4eba.png with policy now 2023-08-29 02:26:35,353 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:35,429 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,429 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/rgb_render_0_0221db07b3e58dba4e0c.png with policy now 2023-08-29 02:26:35,456 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:35,456 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/rgb_target_0_39365b313d2292dd4eba.png 2023-08-29 02:26:35,466 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/rgb_render_0_0221db07b3e58dba4e0c.png 2023-08-29 02:26:35,480 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media 2023-08-29 02:26:35,493 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val 2023-08-29 02:26:35,493 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis 2023-08-29 02:26:35,493 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images 2023-08-29 02:26:35,514 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,529 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/rgb_error_0_26876becb829857eefc2.png with policy now 2023-08-29 02:26:35,577 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,577 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/normal_0_7e1272e24357100780aa.png with policy now 2023-08-29 02:26:35,650 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,650 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/inv_depth_0_fbd0daa20de9191a79ae.png with policy now 2023-08-29 02:26:35,719 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,719 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/opacity_0_5608dd29b12d5fdfbf5d.png with policy now 2023-08-29 02:26:35,719 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: partial_history 2023-08-29 02:26:36,147 INFO wandb-upload_1:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/gedptt0m-media/images/val/vis/rgb_render_0_0221db07b3e58dba4e0c.png 2023-08-29 02:26:36,497 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/rgb_error_0_26876becb829857eefc2.png 2023-08-29 02:26:36,497 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/inv_depth_0_fbd0daa20de9191a79ae.png 2023-08-29 02:26:36,502 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/normal_0_7e1272e24357100780aa.png 2023-08-29 02:26:36,515 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/opacity_0_5608dd29b12d5fdfbf5d.png 2023-08-29 02:26:36,523 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis 2023-08-29 02:26:36,543 INFO wandb-upload_0:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/fsbkbi5x-media/images/val/vis/rgb_target_0_39365b313d2292dd4eba.png 2023-08-29 02:26:36,647 INFO wandb-upload_5:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/6sdhpbpw-media/images/val/vis/opacity_0_5608dd29b12d5fdfbf5d.png 2023-08-29 02:26:36,935 INFO wandb-upload_4:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/bdsumm31-media/images/val/vis/inv_depth_0_fbd0daa20de9191a79ae.png 2023-08-29 02:26:37,015 INFO wandb-upload_3:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/l7ppndyh-media/images/val/vis/normal_0_7e1272e24357100780aa.png 2023-08-29 02:26:37,315 INFO wandb-upload_1:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/z8ms38am-media/images/val/vis/rgb_render_0_0221db07b3e58dba4e0c.png 2023-08-29 02:26:37,442 INFO wandb-upload_2:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/abedw3ot-media/images/val/vis/rgb_error_0_26876becb829857eefc2.png 2023-08-29 02:26:37,877 INFO wandb-upload_5:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/zj43bbvk-media/images/val/vis/inv_depth_0_fbd0daa20de9191a79ae.png 2023-08-29 02:26:37,912 INFO wandb-upload_0:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/83lwwq0x-media/images/val/vis/rgb_target_0_39365b313d2292dd4eba.png 2023-08-29 02:26:38,418 INFO wandb-upload_1:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/w8geu1ah-media/images/val/vis/opacity_0_5608dd29b12d5fdfbf5d.png 2023-08-29 02:26:38,436 INFO wandb-upload_3:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/fhn1dp2f-media/images/val/vis/normal_0_7e1272e24357100780aa.png 2023-08-29 02:26:38,560 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:39,096 INFO wandb-upload_2:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/tpg4ibp2-media/images/val/vis/rgb_error_0_26876becb829857eefc2.png 2023-08-29 02:26:40,374 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:40,700 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:42,775 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log

can it be a problem about my internet connection? or is there an alternative that I can use tensorBoard to visualize the training? thx

Aug 29 '23 02:08 ZirongChan

This seems to be an issue on the W&B side. We don't support Tensorboard right now, but PRs are welcome if you'd like to help add this support.

Aug 29 '23 22:08 chenhsuanlin

This seems to be an issue on the W&B side. We don't support Tensorboard right now, but PRs are welcome if you'd like to help add this support.

It seems to be an issue related to the distributed training. I've also tried setting the --single_gpu flag, it did not work. The error log was still about distributed training, as in the log "raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:" .

Is there any switch somewhere else in the code that I can make sure the distributed training is disabled?

Sep 08 '23 08:09 ZirongChan

To disable distributed training, you can run python train.py --single_gpu ... instead of torchrun --nproc_per_node=1 train.py ... and it should work.

Sep 14 '23 19:09 chenhsuanlin