TensorFlowASR save h5 file error

Jan 27 '22 10:01 scshtyk

When I finish training an epoch, it has a bug when it auto-saves the h5 file:

Traceback (most recent call last): File "examples/conformer/train.py", line 148, in conformer.fit( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1229, in fit callbacks.on_epoch_end(epoch, epoch_logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 435, in on_epoch_end callback.on_epoch_end(epoch, logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1369, in on_epoch_end self._save_model(epoch=epoch, logs=logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1443, in _save_model raise e File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1430, in _save_model self.model.save_weights( File "/home/TensorFlowASR/tensorflow_asr/models/base_model.py", line 47, in save_weights super().save_weights(filepath=path, overwrite=overwrite, save_format=save_format, options=options) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 2217, in save_weights with h5py.File(filepath, 'w') as f: File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 424, in init fid = make_fid(name, mode, userblock_size, File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 196, in make_fid fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 116, in h5py.h5f.create OSError: Unable to create file (unable to open file: name = './local/conformer/checkpoints/01.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)

Jan 27 '22 10:01 scshtyk

I didn't change any code， just run example/conformer/train.py

Jan 27 '22 10:01 scshtyk

and is it normal to train an epoch and still have a loss of 300+?

Jan 27 '22 11:01 scshtyk

I found that the demo did not create the "./local/conformer/checkpoints" folder automatically, I created it manually and then it worked

Jan 27 '22 12:01 scshtyk

I didn't change any code， just run example/conformer/train.py

hi scshtyk:

How do you set the parameters of your train.py?

Best Regards. Rick.

Jan 28 '22 03:01 RickChou

@RickChou

hi Rick： The command I use to run this file is python examples/conformer/train.py --subwords subwords file was created by running generate_vocab_subwords.py

the config.yml is speech_config: sample_rate: 16000 frame_ms: 25 stride_ms: 10 num_feature_bins: 80 feature_type: log_mel_spectrogram preemphasis: 0.97 normalize_signal: True normalize_feature: True normalize_per_frame: False

decoder_config: vocabulary: ./vocabularies/librispeech/librispeech_clean100.subwords target_vocab_size: 1000 max_subword_length: 10 blank_at_zero: True beam_width: 0 norm_score: True corpus_files: - ./trans.tsv

model_config: name: conformer encoder_subsampling: type: conv2d filters: 144 kernel_size: 3 strides: 2 encoder_positional_encoding: sinusoid encoder_dmodel: 144 encoder_num_blocks: 16 encoder_head_size: 36 encoder_num_heads: 4 encoder_mha_type: relmha encoder_kernel_size: 32 encoder_fc_factor: 0.5 encoder_dropout: 0.1 prediction_embed_dim: 320 prediction_embed_dropout: 0 prediction_num_rnns: 1 prediction_rnn_units: 320 prediction_rnn_type: lstm prediction_rnn_implementation: 2 prediction_layer_norm: True prediction_projection_units: 0 joint_dim: 320 prejoint_linear: True joint_activation: tanh joint_mode: add

learning_config: train_dataset_config: use_tf: True augmentation_config: feature_augment: time_masking: num_masks: 10 mask_factor: 100 p_upperbound: 0.05 freq_masking: num_masks: 1 mask_factor: 27 data_paths: - ./trans.tsv tfrecords_dir: /mnt/Data/MLDL/Datasets/ASR/Raw/LibriSpeech/tfrecords_1030 shuffle: True cache: True buffer_size: 100 drop_remainder: True stage: train

eval_dataset_config: use_tf: True data_paths: - ./dev_transcripts.tsv tfrecords_dir: /mnt/Data/MLDL/Datasets/ASR/Raw/LibriSpeech/tfrecords_1030 shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: eval

test_dataset_config: use_tf: True data_paths: - ./test_transcripts.tsv tfrecords_dir: null shuffle: False cache: True buffer_size: 100 drop_remainder: True stage: test

optimizer_config: warmup_steps: 40000 beta_1: 0.9 beta_2: 0.98 epsilon: 1e-9

running_config: batch_size: 2 num_epochs: 50 checkpoint: filepath: ./local/conformer/checkpoints/{epoch:02d}.h5 save_best_only: False save_weights_only: True save_freq: epoch states_dir: ./local/conformer/states tensorboard: log_dir: ./local/conformer/tensorboard histogram_freq: 1 write_graph: True write_images: True update_freq: epoch profile_batch: 2

I found an old issue： https://github.com/TensorSpeech/TensorFlowASR/issues/33 and I think it may be caused by the batchsize is too small. So I change the parameters to "python examples/conformer/train.py --bs 3 --mxp --devices 0 1 2 3 4 5 6 7" the loss is reduced to "loss: 15.7520 - val_loss: 6.5997" and the test result is INFO:tensorflow:greedy_wer: 0.5681679844856262 INFO:tensorflow:greedy_cer: 0.39872127771377563 INFO:tensorflow:beamsearch_wer: 1.0 INFO:tensorflow:beamsearch_cer: 1.0 Is there anything else wrong with the parameters?

Best Regards. tyk.

Jan 29 '22 03:01 scshtyk

hi Rick

I can train with 4 GPU cards and default params，but it still has problems with the accuracy of the test. Unfortunately I have not encountered this error. Does this error occur at the beginning of training?

Best Regards. tyk.

------------------ 原始邮件 ------------------ 发件人: "TensorSpeech/TensorFlowASR" @.>; 发送时间: 2022年2月7日(星期一) 下午2:17 @.>; @.@.>; 主题: Re: [TensorSpeech/TensorFlowASR] save h5 file error (Issue #243)

hi @scshtyk Can you successfully train it? I got the following error, I wonder if you have solved it?

Traceback (most recent call last): File "train.py", line 127, in optimizer = tf.keras.optimizers.Adam(TransformerSchedule(d_model=conformer.dmodel,warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),max_lr=(0.05 / math.sqrt(conformer.dmodel))),**config.learning_config.optimizer_config) File "C:\Users\User.conda\envs\python38\lib\site-packages\tensorflow\python\keras\optimizer_v2\adam.py", line 112, in init super(Adam, self).init(name, **kwargs) File "C:\Users\User.conda\envs\python38\lib\site-packages\tensorflow\python\keras\optimizer_v2\optimizer_v2.py", line 368, in init raise TypeError("Unexpected keyword argument " TypeError: Unexpected keyword argument passed to optimizer: beta1

Best Regards. Rick.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>

Feb 07 '22 09:02 scshtyk

@scshtyk I can successfully train. This problem is because I use the old config so it does not support beta1 parameters. Modifying it to beta_1 can successfully train. thanks for your reply.

Best Regards. Rick.

Feb 07 '22 11:02 RickChou

When I finish training an epoch, it has a bug when it auto-saves the h5 file:

Traceback (most recent call last): File "examples/conformer/train.py", line 148, in conformer.fit( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1229, in fit callbacks.on_epoch_end(epoch, epoch_logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 435, in on_epoch_end callback.on_epoch_end(epoch, logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1369, in on_epoch_end self._save_model(epoch=epoch, logs=logs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1443, in _save_model raise e File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/callbacks.py", line 1430, in _save_model self.model.save_weights( File "/home/TensorFlowASR/tensorflow_asr/models/base_model.py", line 47, in save_weights super().save_weights(filepath=path, overwrite=overwrite, save_format=save_format, options=options) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 2217, in save_weights with h5py.File(filepath, 'w') as f: File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 424, in init fid = make_fid(name, mode, userblock_size, File "/usr/local/lib/python3.8/dist-packages/h5py/_hl/files.py", line 196, in make_fid fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 116, in h5py.h5f.create OSError: Unable to create file (unable to open file: name = './local/conformer/checkpoints/01.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)

hi Rick I can train with 4 GPU cards and default params，but it still has problems with the accuracy of the test. Unfortunately I have not encountered this error. Does this error occur at the beginning of training? Best Regards. tyk. … ------------------ 原始邮件 ------------------ 发件人: "TensorSpeech/TensorFlowASR" @.>; 发送时间: 2022年2月7日(星期一) 下午2:17 @.>; @.@.>; 主题: Re: [TensorSpeech/TensorFlowASR] save h5 file error (Issue #243) hi @scshtyk Can you successfully train it? I got the following error, I wonder if you have solved it? Traceback (most recent call last): File "train.py", line 127, in optimizer = tf.keras.optimizers.Adam(TransformerSchedule(d_model=conformer.dmodel,warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),max_lr=(0.05 / math.sqrt(conformer.dmodel))),*config.learning_config.optimizer_config) File "C:\Users\User.conda\envs\python38\lib\site-packages\tensorflow\python\keras\optimizer_v2\adam.py", line 112, in init super(Adam, self).init(name, kwargs) File "C:\Users\User.conda\envs\python38\lib\site-packages\tensorflow\python\keras\optimizer_v2\optimizer_v2.py", line 368, in init raise TypeError("Unexpected keyword argument " TypeError: Unexpected keyword argument passed to optimizer: beta1 Best Regards. Rick. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.>

Sorry, your question may need to wait for my training for a while before I can discuss it with you.

Best Regards. Rick.

Feb 07 '22 11:02 RickChou

and is it normal to train an epoch and still have a loss of 300+?

Yes, usually we train on 20 epochs to reach the loss around 5 to 9

Sep 02 '22 04:09 nglehuy

I found that the demo did not create the "./local/conformer/checkpoints" folder automatically, I created it manually and then it worked

I’ll check the code that create the directory automatically. However, to ensure it works, we can create directories first, setup all the files and path correctly before training. I’ll close the issue here. Feel free to reopen if you have further questions.

Sep 02 '22 05:09 nglehuy

TensorFlowASR TensorFlowASR copied to clipboard

save h5 file error

TensorFlowASR
TensorFlowASR copied to clipboard