I am kee getting stuck in this issue while fine tuning

Open shekharmeena2896 opened this issue 6 months ago • 0 comments

I don't know either it gets stuck and gives me cuda out of memory even tho I have 2 v100 gpu (piper) root@t1-le-45-gra7:/home/ubuntu/piper# TORCH_DISTRIBUTED_DEBUG=DETAIL python3 -m piper_train
--max-phoneme-ids 400
--dataset-dir /home/ubuntu/piper/female_dataset_prepared
--accelerator gpu
--devices -1
--strategy ddp
--batch-size 2
--validation-split 0.0
--num-test-examples 0
--max_epochs 10000
--resume_from_checkpoint /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt
--checkpoint-epochs 1
--precision 16 DEBUG:piper_train:Namespace(dataset_dir='/home/ubuntu/piper/female_dataset_prepared', checkpoint_epochs=1, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='-1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=10000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=50, accelerator='gpu', strategy='ddp', sync_batchnorm=False, precision=16, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=2, validation_split=0.0, num_test_examples=0, max_phoneme_ids=400, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234) Using 16bit native Automatic Mixed Precision (AMP) /root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:52: LightningDeprecationWarning: Setting `Trainer(resume_from_checkpoint=)` is deprecated in v1.5 and will be removed in v1.7. Please pass `Trainer.fit(ckpt_path=)` directly instead. rank_zero_deprecation( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs DEBUG:piper_train:Checkpoints will be saved every 1 epoch(s) DEBUG:vits.dataset:Loading dataset: /home/ubuntu/piper/female_dataset_prepared/dataset.jsonl WARNING:vits.dataset:Skipped 1084 utterance(s) DEBUG:piper_train:Namespace(dataset_dir='/home/ubuntu/piper/female_dataset_prepared', checkpoint_epochs=1, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='-1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=10000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=50, accelerator='gpu', strategy='ddp', sync_batchnorm=False, precision=16, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=2, validation_split=0.0, num_test_examples=0, max_phoneme_ids=400, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234) DEBUG:piper_train:Checkpoints will be saved every 1 epoch(s) DEBUG:vits.dataset:Loading dataset: /home/ubuntu/piper/female_dataset_prepared/dataset.jsonl /root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:731: LightningDeprecationWarning: `trainer.resume_from_checkpoint` is deprecated in v1.5 and will be removed in v2.0. Specify the fit checkpoint path with `trainer.fit(ckpt_path=)` instead. ckpt_path = ckpt_path or self.resume_from_checkpoint Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 WARNING:vits.dataset:Skipped 1084 utterance(s) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. Restoring states from the checkpoint path at /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt DEBUG:fsspec.local:open file: /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt DEBUG:fsspec.local:open file: /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt /root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1659: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}"]. rank_zero_warn( LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] DEBUG:fsspec.local:open file: /home/ubuntu/piper/female_dataset_prepared/lightning_logs/version_24/hparams.yaml Restored all states from the checkpoint file at /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt /root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of DataLoader across ranks is zero. Please make sure this was your intention. rank_zero_warn( /root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:236: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 16 which is the number of cpus on this machine) in theDataLoader` init to improve performance. rank_zero_warn( /root/anaconda3/envs/piper/lib/python3.10/site-packages/torch/functional.py:632: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /root/anaconda3/envs/piper/lib/python3.10/site-packages/torch/functional.py:632: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /root/anaconda3/envs/piper/lib/python3.10/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 9, 96], strides() = [61536, 96, 1] bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /root/anaconda3/envs/piper/lib/python3.10/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 9, 96], strides() = [62304, 96, 1] bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

Jun 16 '25 10:06 shekharmeena2896