OpenLRM
OpenLRM copied to clipboard
ValueError: math domain error
summary
- error happens when training
- tested on Runpod's A100 SXM 80GB x4 GPUs, 128 vCPU 1006 GB RAM
- runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
reproduction of the error
-
installation of OpenLRM was successful
-
data preparation using
blender_script.pywas successful, generated 100 pairs of data each containigrgba,pose,intrinsics.npy. -
configuration of
training_sample.yamlandaccelerate_training.yamlas follows :experiment: type: lrm seed: 42 parent: lrm-objaverse child: small-dummyrun model: camera_embed_dim: 1024 rendering_samples_per_ray: 96 transformer_dim: 512 transformer_layers: 12 transformer_heads: 8 triplane_low_res: 32 triplane_high_res: 64 triplane_dim: 32 encoder_type: dinov2 encoder_model_name: dinov2_vits14_reg encoder_feat_dim: 384 encoder_freeze: false dataset: subsets: - name: objaverse root_dirs: - "/root/OpenLRM/views" # modified this value meta_path: train: "/root/OpenLRM/train_uids.json" # modified this value val: "/root/OpenLRM/val_uids.json" # modified this value sample_rate: 1.0 sample_side_views: 3 source_image_res: 224 render_image: low: 64 high: 192 region: 64 normalize_camera: true normed_dist_to_center: auto num_train_workers: 4 num_val_workers: 2 pin_mem: true train: mixed_precision: bf16 # REPLACE THIS BASED ON GPU TYPE find_unused_parameters: false loss: pixel_weight: 1.0 perceptual_weight: 1.0 tv_weight: 5e-4 optim: lr: 4e-4 weight_decay: 0.05 beta1: 0.9 beta2: 0.95 clip_grad_norm: 1.0 scheduler: type: cosine warmup_real_iters: 3000 batch_size: 16 # REPLACE THIS (PER GPU) accum_steps: 1 # REPLACE THIS epochs: 60 # REPLACE THIS debug_global_steps: null val: batch_size: 4 global_step_period: 1000 debug_batches: null saver: auto_resume: true load_model: null checkpoint_root: ./exps/checkpoints checkpoint_global_steps: 1000 checkpoint_keep_level: 5 logger: stream_level: WARNING log_level: INFO log_root: ./exps/logs tracker_root: ./exps/trackers enable_profiler: false trackers: - tensorboard image_monitor: train_global_steps: 100 samples_per_log: 4 compile: suppress_errors: true print_specializations: true disable: truecompute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 4 # only modified this value rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false -
the error message :
[TRAIN STEP]loss=0.624, loss_pixel=0.0577, loss_perceptual=0.566, loss_tv=0.698, lr=8.13e-6: 100%|███████████████████████████████████████████████| 60/60 [04:55<00:00, 4.92s/it]Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/OpenLRM/openlrm/launch.py", line 36, in <module> main() File "/root/OpenLRM/openlrm/launch.py", line 32, in main runner.run() File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 338, in run self.train() File "/root/OpenLRM/openlrm/runners/train/lrm.py", line 343, in train self.save_checkpoint() File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 118, in wrapper result = accelerated_func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 669, in _inner return PartialState().on_main_process(function)(*args, **kwargs) File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 246, in save_checkpoint cur_order = ckpt_base ** math.floor(math.log(max_ckpt // ckpt_period, ckpt_base)) ValueError: math domain error [2024-04-17 08:24:09,179] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65932 closing signal SIGTERM [2024-04-17 08:24:09,183] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65933 closing signal SIGTERM [2024-04-17 08:24:09,186] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65934 closing signal SIGTERM [2024-04-17 08:24:09,301] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 65931) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ openlrm.launch FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-17_08:24:09 host : dcf76dfb9908 rank : 0 (local_rank: 0) exitcode : 1 (pid: 65931) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Hey @hayoung-jeremy , try reducing the value of global_step_period under val: in the train sample yaml file , until it stops giving the error, which worked for me when I was trying to train with 350 objects.
Wow, you're my savior, thank you so much! I'll try it!
Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :
...
train:
mixed_precision: bf16
find_unused_parameters: false
loss:
pixel_weight: 1.0
perceptual_weight: 1.0
tv_weight: 5e-4
optim:
lr: 4e-4
weight_decay: 0.05
beta1: 0.9
beta2: 0.95
clip_grad_norm: 1.0
scheduler:
type: cosine
warmup_real_iters: 3000
batch_size: 16
accum_steps: 1
epochs: 100 # MODIFIED : 60 -> 100
debug_global_steps: null
val:
batch_size: 4
global_step_period: 100 # MODIFIED : 1000 -> 100
debug_batches: null
...
and successfully generated a checkpoint as follows :
[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00, 5.10s/it]
But it seems the loss value is too high. What should I modify to decrease the loss value? Should I increase the epoch to 1000? And what is the ideal loss values for successfully generated checkpoint? Could you share me your case? Thank you so much for your help
Thank you @kunalkathare , I've tried with the following config, modified
epochandglobal_step_period:... train: mixed_precision: bf16 find_unused_parameters: false loss: pixel_weight: 1.0 perceptual_weight: 1.0 tv_weight: 5e-4 optim: lr: 4e-4 weight_decay: 0.05 beta1: 0.9 beta2: 0.95 clip_grad_norm: 1.0 scheduler: type: cosine warmup_real_iters: 3000 batch_size: 16 accum_steps: 1 epochs: 100 # MODIFIED : 60 -> 100 debug_global_steps: null val: batch_size: 4 global_step_period: 100 # MODIFIED : 1000 -> 100 debug_batches: null ...and successfully generated a checkpoint as follows :
[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00, 5.10s/it]But it seems the loss value is too high. What should I modify to decrease the loss value? Should I increase the epoch to 1000? And what is the ideal loss values for successfully generated checkpoint? Could you share me your case? Thank you so much for your help
The loss value is reduced when the size of the dataset is more, and I guess you can increase the epochs and see if it affects.
Thank you for kind reply @kunalkathare !
- I don't have enough dataset for now, can I just copy the same data to increase the amount of it?
- And I've tried to increase the epoch to
1000, it also generated the checkpoint with the loss value about0.3. But the inference result quality from that checkpoint is not that good, as you can see in this issue. So I'm going to try to increase the epoch to10000, is it okay? If it is, what kind of values should I adjust from thetrain_sample.yaml?
Really great help from you, many thanks for your assistance.