I am kee getting stuck in this issue while fine tuning
I don't know either it gets stuck and gives me cuda out of memory even tho I have 2 v100 gpu
(piper) root@t1-le-45-gra7:/home/ubuntu/piper# TORCH_DISTRIBUTED_DEBUG=DETAIL python3 -m piper_train
--max-phoneme-ids 400
--dataset-dir /home/ubuntu/piper/female_dataset_prepared
--accelerator gpu
--devices -1
--strategy ddp
--batch-size 2
--validation-split 0.0
--num-test-examples 0
--max_epochs 10000
--resume_from_checkpoint /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt
--checkpoint-epochs 1
--precision 16
DEBUG:piper_train:Namespace(dataset_dir='/home/ubuntu/piper/female_dataset_prepared', checkpoint_epochs=1, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='-1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=10000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=50, accelerator='gpu', strategy='ddp', sync_batchnorm=False, precision=16, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=2, validation_split=0.0, num_test_examples=0, max_phoneme_ids=400, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234)
Using 16bit native Automatic Mixed Precision (AMP)
/root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:52: LightningDeprecationWarning: Setting Trainer(resume_from_checkpoint=) is deprecated in v1.5 and will be removed in v1.7. Please pass Trainer.fit(ckpt_path=) directly instead.
rank_zero_deprecation(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
DEBUG:piper_train:Checkpoints will be saved every 1 epoch(s)
DEBUG:vits.dataset:Loading dataset: /home/ubuntu/piper/female_dataset_prepared/dataset.jsonl
WARNING:vits.dataset:Skipped 1084 utterance(s)
DEBUG:piper_train:Namespace(dataset_dir='/home/ubuntu/piper/female_dataset_prepared', checkpoint_epochs=1, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='-1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=10000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=50, accelerator='gpu', strategy='ddp', sync_batchnorm=False, precision=16, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=2, validation_split=0.0, num_test_examples=0, max_phoneme_ids=400, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234)
DEBUG:piper_train:Checkpoints will be saved every 1 epoch(s)
DEBUG:vits.dataset:Loading dataset: /home/ubuntu/piper/female_dataset_prepared/dataset.jsonl
/root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:731: LightningDeprecationWarning: trainer.resume_from_checkpoint is deprecated in v1.5 and will be removed in v2.0. Specify the fit checkpoint path with trainer.fit(ckpt_path=) instead.
ckpt_path = ckpt_path or self.resume_from_checkpoint
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
WARNING:vits.dataset:Skipped 1084 utterance(s)
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
distributed_backend=nccl All distributed processes registered. Starting with 2 processes
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Restoring states from the checkpoint path at /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt
DEBUG:fsspec.local:open file: /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt
DEBUG:fsspec.local:open file: /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt
/root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1659: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 1, 'train_time_interval': None}"].
rank_zero_warn(
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
DEBUG:fsspec.local:open file: /home/ubuntu/piper/female_dataset_prepared/lightning_logs/version_24/hparams.yaml
Restored all states from the checkpoint file at /home/ubuntu/piper/epoch%3D6618-step%3D187068.ckpt
/root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of DataLoader across ranks is zero. Please make sure this was your intention.
rank_zero_warn(
/root/anaconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:236: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 16 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
/root/anaconda3/envs/piper/lib/python3.10/site-packages/torch/functional.py:632: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
/root/anaconda3/envs/piper/lib/python3.10/site-packages/torch/functional.py:632: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:31.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
/root/anaconda3/envs/piper/lib/python3.10/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1, 9, 96], strides() = [61536, 96, 1]
bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/root/anaconda3/envs/piper/lib/python3.10/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1, 9, 96], strides() = [62304, 96, 1]
bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass