STT icon indicating copy to clipboard operation
STT copied to clipboard

Bug: Training Always Stuck on Test Epoch 0

Open ChainofChaos opened this issue 2 years ago • 1 comments

Describe the bug Training stuck at test epoch 0 (Testing model). I was training the common voice dataset using the google colab after it finished training the model. It always stuck on "Test epoch" I already tried to rerun it from start (Fresh colab), and again it always stuck on Test epoch on steps: 0 I also waited for 2 hours and didn't see any progress. It didn't get to the "WER" part it just stuck on step 0 without any progress. image

To Reproduce Steps to reproduce the behavior:

  1. Run the following command
!python3 -m coqui_stt_training.train --train_cudnn true --n_hidden 2048 --epochs 30 \
      --export_dir /content/models \
      --checkpoint_dir /content/model_checkpoints \
      --train_files [train_file]\
      --dev_files [dev_file] \
      --test_files [test_file] \
      --learning_rate 0.0001 --train_batch_size 128 --test_batch_size 128 --dev_batch_size 128
  1. Wait
  2. See an error

Expected behavior After it finished with the training model, the test epoch should have continued without problem.

Environment (please complete the following information):

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
  • TensorFlow installed from (our builds, or upstream TensorFlow): 1.15.4
  • TensorFlow version (use command below): 1.15.4
  • Python version: 3.7.13
  • CUDA/cuDNN version: 10.0
  • GPU model and memory: Tesla T4 (Google Colab)
  • Exact command to reproduce:
!python3 -m coqui_stt_training.train --train_cudnn true --n_hidden 2048 --epochs 30 \
      --export_dir /content/models \
      --checkpoint_dir /content/model_checkpoints \
      --train_files [train_file]\
      --dev_files [dev_file] \
      --test_files [test_file] \
      --learning_rate 0.0001 --train_batch_size 128 --test_batch_size 128 --dev_batch_size 128

Additional context The common voice dataset was already imported using import_cv2.py before training.

ChainofChaos avatar Jun 24 '22 16:06 ChainofChaos

--export_dir should only be used with coqui_stt_training.export. Train without this flag and use it only when you want to export. If you only want to test your models use coqui_stt_training.evaluate (again without --export_dir).

wasertech avatar Jul 05 '22 11:07 wasertech