STT icon indicating copy to clipboard operation
STT copied to clipboard

Bug: Transfer Learning + Plateau Detection Feature results in Tensor shape error (wrong language checkpoint loaded).

Open HarikalarKutusu opened this issue 2 years ago • 0 comments

Describe the bug When using transfer training from English checkpoints, and if you use plateau detection reload_best_checkpoint is initialized with allow_drop_layers=False, thus re-initializes itself from English alphabet, which results in a Tensor shape error.

To Reproduce Steps to reproduce the behavior:

  1. I used it to train Common Voice v8.0 Russian dataset from Coqui v1.0.0 English breakpoints (in Google Colab).
!python -m coqui_stt_training.train \
  --show_progressbar true \
  --train_cudnn true \
  --force_initialize_learning_rate true \
  --epochs 300 \
  --early_stop true \
  --es_epochs 10 \
  --learning_rate 0.001 \
  --reduce_lr_on_plateau true \
  --plateau_epochs 5 \
  --plateau_reduction 0.1 \
  --dropout_rate 0.25 \
  --max_to_keep 1 \
  --drop_source_layers 2 \
  --train_batch_size 128 \
  --dev_batch_size 128 \
  --augment "frequency_mask[p=0.8,n=3:5,size=2:4]" "time_mask[p=0.8,domain=spectrogram,n=3:5,size=10:200]" \
  --alphabet_config_path /content/drive/MyDrive/cv-datasets/ru/alphabet.txt \
  --load_checkpoint_dir /content/drive/MyDrive/cv-datasets/en/coqui-stt-1.0.0-checkpoint \
  --save_checkpoint_dir /content/data/ru/v8.0-r2b/checkpoints \
  --summary_dir /content/data/ru/v8.0-r2b/summary \
  --train_files /content/data/ru/v8.0/clips/train.csv \
  --dev_files /content/data/ru/v8.0/clips/dev.csv
  1. The run worked fine for a couple of epocs, then detected the plateau.

  2. When it reached the reduce LR step, it errors out:

--------------------------------------------------------------------------------
Epoch 8 |   Training | Elapsed Time: 0:04:37 | Steps: 166 | Loss: 25.475381     
Epoch 8 | Validation | Elapsed Time: 0:01:07 | Steps: 72 | Loss: 45.444256 | Dataset: /content/data/ru/v8.0/clips/dev.csv
I Loading best validating checkpoint from /content/drive/MyDrive/cv-datasets/en/coqui-stt-1.0.0-checkpoint/best_dev-3663881
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/coqui_stt_training/train.py", line 687, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/coqui_stt_training/train.py", line 657, in main
    train()
  File "/usr/local/lib/python3.7/dist-packages/coqui_stt_training/train.py", line 599, in train
    reload_best_checkpoint(session)
  File "/usr/local/lib/python3.7/dist-packages/coqui_stt_training/util/checkpoints.py", line 166, in reload_best_checkpoint
    _load_or_init_impl(session, ["best"], allow_drop_layers=False, allow_lr_init=False)
  File "/usr/local/lib/python3.7/dist-packages/coqui_stt_training/util/checkpoints.py", line 132, in _load_or_init_impl
    silent=silent,
  File "/usr/local/lib/python3.7/dist-packages/coqui_stt_training/util/checkpoints.py", line 90, in _load_checkpoint
    v.load(ckpt.get_tensor(v.op.name), session=session)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/variables.py", line 1033, in load
    session.run(self.initializer, {self.initializer.inputs[1]: value})
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1156, in _run
    (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (29,) for Tensor 'layer_6/bias/Initializer/zeros:0', which has shape '(35,)'

Expected behavior It should load from the target language's current best breakpoint.

Environment (please complete the following information): Google Colab

Additional context Here is the answer from @reuben in Matrix chat:

I think that's a bug in our code when using both transfer learning to a different checkpoint and the auto LR decay on plateau feature
notice in the stack trace the allow_drop_layers=False parameter in the checkpoint loading logic
it is correct in spirit, we don't want to drop the alphabet layer from the checkpoint being fine tuned
but it looks like it's looking again at the original checkpoint that you started the transfer from
rather than the current one, which is not the right behavior for reload_best_checkpoint
...
I think the proper fix is to parametrize _checkpoint_path_or_none on the root folder as well as on the checkpoint_filename - right now it's always looking at Config.load_checkpoint_dir, but for reload_best_checkpoint we want it to load from [Config.save](http://config.save/)_checkpoint_dir

HarikalarKutusu avatar Feb 06 '22 14:02 HarikalarKutusu