STT
STT copied to clipboard
Bug: Training Always Stuck on Test Epoch 0
Describe the bug
Training stuck at test epoch 0 (Testing model). I was training the common voice dataset using the google colab after it finished training the model. It always stuck on "Test epoch" I already tried to rerun it from start (Fresh colab), and again it always stuck on Test epoch on steps: 0 I also waited for 2 hours and didn't see any progress. It didn't get to the "WER" part it just stuck on step 0 without any progress.
To Reproduce Steps to reproduce the behavior:
- Run the following command
!python3 -m coqui_stt_training.train --train_cudnn true --n_hidden 2048 --epochs 30 \
--export_dir /content/models \
--checkpoint_dir /content/model_checkpoints \
--train_files [train_file]\
--dev_files [dev_file] \
--test_files [test_file] \
--learning_rate 0.0001 --train_batch_size 128 --test_batch_size 128 --dev_batch_size 128
- Wait
- See an error
Expected behavior After it finished with the training model, the test epoch should have continued without problem.
Environment (please complete the following information):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
- TensorFlow installed from (our builds, or upstream TensorFlow): 1.15.4
- TensorFlow version (use command below): 1.15.4
- Python version: 3.7.13
- CUDA/cuDNN version: 10.0
- GPU model and memory: Tesla T4 (Google Colab)
- Exact command to reproduce:
!python3 -m coqui_stt_training.train --train_cudnn true --n_hidden 2048 --epochs 30 \
--export_dir /content/models \
--checkpoint_dir /content/model_checkpoints \
--train_files [train_file]\
--dev_files [dev_file] \
--test_files [test_file] \
--learning_rate 0.0001 --train_batch_size 128 --test_batch_size 128 --dev_batch_size 128
Additional context
The common voice dataset was already imported using import_cv2.py
before training.
--export_dir
should only be used with coqui_stt_training.export
. Train without this flag and use it only when you want to export. If you only want to test your models use coqui_stt_training.evaluate
(again without --export_dir
).