TensorFlowASR
TensorFlowASR copied to clipboard
save h5 file error
When I finish training an epoch, it has a bug when it auto-saves the h5 file:
Traceback (most recent call last):
File "examples/conformer/train.py", line 148, in
I didn't change any code, just run example/conformer/train.py
and is it normal to train an epoch and still have a loss of 300+?
I found that the demo did not create the "./local/conformer/checkpoints" folder automatically, I created it manually and then it worked
I didn't change any code, just run example/conformer/train.py
hi scshtyk:
How do you set the parameters of your train.py?
Best Regards. Rick.
@RickChou
hi Rick: The command I use to run this file is python examples/conformer/train.py --subwords subwords file was created by running generate_vocab_subwords.py
the config.yml is speech_config: sample_rate: 16000 frame_ms: 25 stride_ms: 10 num_feature_bins: 80 feature_type: log_mel_spectrogram preemphasis: 0.97 normalize_signal: True normalize_feature: True normalize_per_frame: False
decoder_config: vocabulary: ./vocabularies/librispeech/librispeech_clean100.subwords target_vocab_size: 1000 max_subword_length: 10 blank_at_zero: True beam_width: 0 norm_score: True corpus_files: - ./trans.tsv
model_config: name: conformer encoder_subsampling: type: conv2d filters: 144 kernel_size: 3 strides: 2 encoder_positional_encoding: sinusoid encoder_dmodel: 144 encoder_num_blocks: 16 encoder_head_size: 36 encoder_num_heads: 4 encoder_mha_type: relmha encoder_kernel_size: 32 encoder_fc_factor: 0.5 encoder_dropout: 0.1 prediction_embed_dim: 320 prediction_embed_dropout: 0 prediction_num_rnns: 1 prediction_rnn_units: 320 prediction_rnn_type: lstm prediction_rnn_implementation: 2 prediction_layer_norm: True prediction_projection_units: 0 joint_dim: 320 prejoint_linear: True joint_activation: tanh joint_mode: add
learning_config: train_dataset_config: use_tf: True augmentation_config: feature_augment: time_masking: num_masks: 10 mask_factor: 100 p_upperbound: 0.05 freq_masking: num_masks: 1 mask_factor: 27 data_paths: - ./trans.tsv tfrecords_dir: /mnt/Data/MLDL/Datasets/ASR/Raw/LibriSpeech/tfrecords_1030 shuffle: True cache: True buffer_size: 100 drop_remainder: True stage: train
eval_dataset_config: use_tf: True data_paths: - ./dev_transcripts.tsv tfrecords_dir: /mnt/Data/MLDL/Datasets/ASR/Raw/LibriSpeech/tfrecords_1030 shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: eval
test_dataset_config: use_tf: True data_paths: - ./test_transcripts.tsv tfrecords_dir: null shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: test
optimizer_config: warmup_steps: 40000 beta_1: 0.9 beta_2: 0.98 epsilon: 1e-9
running_config: batch_size: 2 num_epochs: 50 checkpoint: filepath: ./local/conformer/checkpoints/{epoch:02d}.h5 save_best_only: False save_weights_only: True save_freq: epoch states_dir: ./local/conformer/states tensorboard: log_dir: ./local/conformer/tensorboard histogram_freq: 1 write_graph: True write_images: True update_freq: epoch profile_batch: 2
I found an old issue: https://github.com/TensorSpeech/TensorFlowASR/issues/33 and I think it may be caused by the batchsize is too small. So I change the parameters to "python examples/conformer/train.py --bs 3 --mxp --devices 0 1 2 3 4 5 6 7" the loss is reduced to "loss: 15.7520 - val_loss: 6.5997" and the test result is INFO:tensorflow:greedy_wer: 0.5681679844856262 INFO:tensorflow:greedy_cer: 0.39872127771377563 INFO:tensorflow:beamsearch_wer: 1.0 INFO:tensorflow:beamsearch_cer: 1.0 Is there anything else wrong with the parameters?
Best Regards. tyk.
hi Rick
I can train with 4 GPU cards and default params,but it still has problems with the accuracy of the test. Unfortunately I have not encountered this error. Does this error occur at the beginning of training?
Best Regards. tyk.
------------------ 原始邮件 ------------------ 发件人: "TensorSpeech/TensorFlowASR" @.>; 发送时间: 2022年2月7日(星期一) 下午2:17 @.>; @.@.>; 主题: Re: [TensorSpeech/TensorFlowASR] save h5 file error (Issue #243)
hi @scshtyk Can you successfully train it? I got the following error, I wonder if you have solved it?
Traceback (most recent call last): File "train.py", line 127, in optimizer = tf.keras.optimizers.Adam(TransformerSchedule(d_model=conformer.dmodel,warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),max_lr=(0.05 / math.sqrt(conformer.dmodel))),**config.learning_config.optimizer_config) File "C:\Users\User.conda\envs\python38\lib\site-packages\tensorflow\python\keras\optimizer_v2\adam.py", line 112, in init super(Adam, self).init(name, **kwargs) File "C:\Users\User.conda\envs\python38\lib\site-packages\tensorflow\python\keras\optimizer_v2\optimizer_v2.py", line 368, in init raise TypeError("Unexpected keyword argument " TypeError: Unexpected keyword argument passed to optimizer: beta1
Best Regards. Rick.
— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>
@scshtyk I can successfully train. This problem is because I use the old config so it does not support beta1 parameters. Modifying it to beta_1 can successfully train. thanks for your reply.
Best Regards. Rick.
When I finish training an epoch, it has a bug when it auto-saves the h5 file:
Traceback (most recent call last): File "examples/conformer/train.py", line 148, in conformer.fit( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1229, in fit callbacks.on_epoch_end(epoch, epoch_logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 435, in on_epoch_end callback.on_epoch_end(epoch, logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1369, in on_epoch_end self._save_model(epoch=epoch, logs=logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1443, in _save_model raise e File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1430, in _save_model self.model.save_weights( File "/home/TensorFlowASR/tensorflow_asr/models/base_model.py", line 47, in save_weights super().save_weights(filepath=path, overwrite=overwrite, save_format=save_format, options=options) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 2217, in save_weights with h5py.File(filepath, 'w') as f: File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 424, in init fid = make_fid(name, mode, userblock_size, File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 196, in make_fid fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 116, in h5py.h5f.create OSError: Unable to create file (unable to open file: name = './local/conformer/checkpoints/01.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)
hi Rick I can train with 4 GPU cards and default params,but it still has problems with the accuracy of the test. Unfortunately I have not encountered this error. Does this error occur at the beginning of training? Best Regards. tyk. … ------------------ 原始邮件 ------------------ 发件人: "TensorSpeech/TensorFlowASR" @.>; 发送时间: 2022年2月7日(星期一) 下午2:17 @.>; @.@.>; 主题: Re: [TensorSpeech/TensorFlowASR] save h5 file error (Issue #243) hi @scshtyk Can you successfully train it? I got the following error, I wonder if you have solved it? Traceback (most recent call last): File "train.py", line 127, in optimizer = tf.keras.optimizers.Adam(TransformerSchedule(d_model=conformer.dmodel,warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),max_lr=(0.05 / math.sqrt(conformer.dmodel))),*config.learning_config.optimizer_config) File "C:\Users\User.conda\envs\python38\lib\site-packages\tensorflow\python\keras\optimizer_v2\adam.py", line 112, in init super(Adam, self).init(name, kwargs) File "C:\Users\User.conda\envs\python38\lib\site-packages\tensorflow\python\keras\optimizer_v2\optimizer_v2.py", line 368, in init raise TypeError("Unexpected keyword argument " TypeError: Unexpected keyword argument passed to optimizer: beta1 Best Regards. Rick. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.>
Sorry, your question may need to wait for my training for a while before I can discuss it with you.
Best Regards. Rick.
and is it normal to train an epoch and still have a loss of 300+?
Yes, usually we train on 20 epochs to reach the loss around 5 to 9
I found that the demo did not create the "./local/conformer/checkpoints" folder automatically, I created it manually and then it worked
I’ll check the code that create the directory automatically. However, to ensure it works, we can create directories first, setup all the files and path correctly before training. I’ll close the issue here. Feel free to reopen if you have further questions.