piper `Trainer.fit` stopped: No training batches.

HI, and thanks for your excellent work! But, when I tried to train the model using T4 GPU on Colab, I got the following error message:

Trainer.fit stopped: No training batches. Here is my setting:

python -m piper_train \
--max-phoneme-ids 400 \
--dataset-dir "{output_dir}" \
--accelerator 'gpu' \
--devices 1 \
--batch-size {batch_size} \
--validation-split {validation_split} \
--num-test-examples {num_test_examples} \
--quality {quality} \
--checkpoint-epochs {checkpoint_epochs} \
--num_ckpt {num_ckpt} \
{save_last_command}\
--log_every_n_steps {log_every_n_steps} \
--max_epochs {max_epochs} \
{ft_command}\
--precision 32

Do you have any ideas on how to fix this?

Feb 25 '24 17:02 bibidentuhanoi

Hi @bibidentuhanoi, How long is your dataset? And for what are you using --max-phoneme-ids? Is it strictly necessary for your dataset?

Feb 25 '24 18:02 rmcpantoja

Heya folks!

I am running into the same issue. I am following the shared notebook project here

my output is:

Output

DEBUG:piper_train:Namespace(dataset_dir='/content/drive/MyDrive/colab/piper/Jarvis', checkpoint_epochs=5, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=10000, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=1000, accelerator='gpu', strategy=None, sync_batchnorm=False, precision=32, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/content/pretrained.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=5, validation_split=0.0, num_test_examples=0, max_phoneme_ids=600, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234, num_ckpt=0, save_last=True) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:52: LightningDeprecationWarning: Setting Trainer(resume_from_checkpoint=) is deprecated in v1.5 and will be removed in v1.7. Please pass Trainer.fit(ckpt_path=) directly instead. rank_zero_deprecation( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs DEBUG:piper_train:Checkpoints will be saved every 5 epoch(s) DEBUG:piper_train:0 Checkpoints will be saved DEBUG:vits.dataset:Loading dataset: /content/drive/MyDrive/colab/piper/Jarvis/dataset.jsonl WARNING:vits.dataset:Skipped 5 utterance(s) /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:731: LightningDeprecationWarning: trainer.resume_from_checkpoint is deprecated in v1.5 and will be removed in v2.0. Specify the fit checkpoint path with trainer.fit(ckpt_path=) instead. ckpt_path = ckpt_path or self.resume_from_checkpoint Restoring states from the checkpoint path at /content/pretrained.ckpt DEBUG:fsspec.local:open file: /content/pretrained.ckpt /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1659: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 10, 'train_time_interval': None, 'save_on_train_epoch_end': True}"]. rank_zero_warn( LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] 2024-07-28 00:03:02.795078: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-07-28 00:03:02.795145: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-07-28 00:03:02.796937: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-07-28 00:03:02.804867: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client. 2024-07-28 00:03:03.930937: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT DEBUG:h5py._conv:Creating converter from 7 to 5 DEBUG:h5py._conv:Creating converter from 5 to 7 DEBUG:h5py._conv:Creating converter from 7 to 5 DEBUG:h5py._conv:Creating converter from 5 to 7 DEBUG:jax._src.path:etils.epath found. Using etils.epath for file I/O. INFO:numexpr.utils:NumExpr defaulting to 2 threads. DEBUG:fsspec.local:open file: /content/drive/MyDrive/colab/piper/Jarvis/lightning_logs/version_7/hparams.yaml Restored all states from the checkpoint file at /content/pretrained.ckpt /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of DataLoader across ranks is zero. Please make sure this was your intention. rank_zero_warn( /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:122: UserWarning: DataLoader returned 0 length. Please make sure this was your intention. rank_zero_warn( /usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of CombinedLoader across ranks is zero. Please make sure this was your intention. rank_zero_warn( Trainer.fit stopped: No training batches.

And my settings are:

Settings

get_ipython().system(f''' python -m piper_train
--max-phoneme-ids 600
--dataset-dir "{output_dir}"
--accelerator 'gpu'
--devices 1
--batch-size {batch_size} \ (set to 6) --validation-split {validation_split} \ (set to 0) --num-test-examples {num_test_examples} \ (set to 0) --quality {quality} \ (set to Medium) --checkpoint-epochs {checkpoint_epochs} \ (set to 5) --num_ckpt {num_ckpt} \ (Set to 0) {save_last_command}
--log_every_n_steps {log_every_n_steps} \ (set to 1000) --max_epochs {max_epochs} \ (set to 10000) {ft_command}
--precision 32 ''')

The reason I have lowered the batch size, and set the max phenome ids is because I was getting out of memory errors like this issue: https://github.com/rhasspy/piper/issues/8

1.zip - this is a copy of the voice file I am using. I had to duplicate this file 5 times and put it on 1 zip to get past the no utterances error in the pre-processing (this issue: https://github.com/rhasspy/piper/issues/297) so the actual wavs ZIP is 5 of this file.

Transcript.txt - this is the transcript I am using

Jul 28 '24 00:07 Terrandel

Hi, That dataset is too short. Also, it contains repeated transcripts.

Jul 28 '24 15:07 rmcpantoja

Hi,

Thank you :) I assumed that I could repeat the dataset to fill it out as it were, since the machine shouldn't really be able to tell the difference, but I suppose I was wrong. I think it would be helpful to put a bit more information in the shared notebook - namely the limits and requirements:

At least 5 wav files required All wave files need to be different voice clips The notebook does mention at least 5 mins for length but it should mention minimum and maximum length for each clip as well, and the relationship between those and workers/batch sizes.

Aug 03 '24 21:08 Terrandel

piper piper copied to clipboard

`Trainer.fit` stopped: No training batches.

piper
piper copied to clipboard