OpenSeq2Seq
OpenSeq2Seq copied to clipboard
Questions about Transfer Learning.
Hello everyone, we are trying transfer learning recently, and here are a couple of things we have discovered:
-
When we try transfer learning (using “load_model” parameter in “config_file”), the program takes a long time after function “assign_ops” (to restore variables) is finished, and memory usage is also very high. We also observe that this scenario is not happening in normal training or using “continue_learning” parameter.
-
When trying transfer learning (we modify the code to compulsory restore these two variables: “bn/moving_mean” and “bn/moving_variance”), the training loss will be normally low in only step 0, then the training loss exploded after that. Even if we set LR as 0 and use the training data of pre-trained model. We found that the issue described above only happens when “dtype” is “mixed”, the training works normally if it is set as tf.float32. And here are the combination we have tried:
-
Pre-trained model: mixed -> transfer learning configuration: mixed
-
Pre-trained model: mixed -> transfer learning configuration: tf.float32
-
Pre-trained model: tf.float32 -> transfer learning configuration: tf.float32
-
Pre-trained model: tf.float32 -> transfer learning configuration: mixed
Only second one and third one work normally.
- We set the parameter “print_loss_steps” in “config_file” as 1, and observed the situation below. In step 0, the program prints 11 different training losses; in step 1, it prints 7 training loss. Then each step will only print one training loss as we expected after step 2. By the way, we use Horovod and “num_gpus” is set as 8.
Configuration:
"random_seed": 0,
"use_horovod": True,
"num_epochs": 1,
"num_gpus": 8,
"batch_size_per_gpu": 16,
"iter_size": 1,
"save_summaries_steps": 100,
"print_loss_steps": 1,
"print_samples_steps": 2200,
"eval_steps": 2200,
"save_checkpoint_steps": 1100,
"num_checkpoints": 1,
Training information:
*** Epoch 0, global step 0: *** Train loss: 3.5081
time per step = 0:24:47.373
*** Sample WER: 0.6000
*** Sample target: when ashur natsir pal died
*** Sample prediction: when ashur natzsir polodied
*** Epoch 0, global step 0: *** Train loss: 4.2053
time per step = 0:00:20.477
*** Epoch 0, global step 0: *** Train loss: 3.2743
time per step = 0:00:7.077
*** Epoch 0, global step 0: *** Train loss: 3.7765
time per step = 0:00:6.780
*** Epoch 0, global step 0: *** Train loss: 1.9787
time per step = 0:00:7.466
*** Epoch 0, global step 0: *** Train loss: 2.4200
time per step = 0:00:8.330
*** Epoch 0, global step 0: *** Train loss: 3.4003
time per step = 0:00:7.882
*** Epoch 0, global step 0: *** Train loss: 3.1898
time per step = 0:00:7.498
*** Epoch 0, global step 0: *** Train loss: 4.4040
time per step = 0:00:9.195
*** Epoch 0, global step 0: *** Train loss: 2.0181
time per step = 0:00:10.196
*** Epoch 0, global step 0: *** Train loss: 1.5686
time per step = 0:00:9.415
*** Epoch 0, global step 0: *** Train loss: 2.5663
time per step = 0:00:9.227
*** Epoch 0, global step 1: *** Train loss: 1261.3043
time per step = 0:00:9.849
*** Epoch 0, global step 1: *** Train loss: 1218.6698
time per step = 0:00:9.150
*** Epoch 0, global step 1: *** Train loss: 1317.1223
time per step = 0:00:9.792
*** Epoch 0, global step 1: *** Train loss: 1286.2400
time per step = 0:00:8.314
*** Epoch 0, global step 1: *** Train loss: 1181.7028
time per step = 0:00:9.601
*** Epoch 0, global step 1: *** Train loss: 1253.2593
time per step = 0:00:9.329
*** Epoch 0, global step 1: *** Train loss: 1203.3721
time per step = 0:00:10.711
*** Epoch 0, global step 2: *** Train loss: 1210.5000
time per step = 0:00:9.804
*** Epoch 0, global step 3: *** Train loss: 1108.2490
time per step = 0:00:9.214
*** Epoch 0, global step 4: *** Train loss: 1072.8215
time per step = 0:00:8.263
- When we try transfer learning and running with Horovod, the program says the variables of “Loss_Optimization” cannot be loaded.
Our conclusion is transfer learning only works with dtype of tf.float32. Can someone helps us explaining this situation? Thanks a lot!
Can you attach the complete logs for mixed precision, please?
Thanks for replying!
Here is the pre-trained model, we trained it with mixed precision:
[[7269,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: c08cb0a9b3b6
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from scratch
*** Training config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'max_duration': 16.7,
'num_audio_features': 64,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7f4105f5bae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': 'mixed'
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7f411bbfa7b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7f4105f5bae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 50,
'iter_size': 1,
'larc_params': {'larc_eta': 0.001},
'load_model': '',
'logdir': 'w2ltestmp',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7f4100115378>,
'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
'num_checkpoints': 1,
'num_epochs': 600,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 80,
'print_samples_steps': 80,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f4105edfe18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Trainable variables:
*** ForwardPass/w2l_encoder/conv11/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/kernel:0
*** shape: (13, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/kernel:0
*** shape: (17, 64, 96), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/kernel:0
*** shape: (21, 96, 160), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/gamma:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/beta:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/kernel:0
*** shape: (25, 160, 128), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/gamma:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/beta:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/kernel:0
*** shape: (29, 128, 192), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/gamma:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/beta:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/kernel:0
*** shape: (1, 192, 256), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/gamma:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/beta:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (256, 29), <dtype: 'float16_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
2019-02-13 01:03:41.034278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:03:41.034341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 01:03:41.618036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:03:41.618100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 01:03:41.624068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:03:41.624093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1
2019-02-13 01:03:41.624118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N
2019-02-13 01:03:41.625121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 01:03:42.339535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:03:42.339595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-13 01:03:42.339622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-13 01:03:42.340540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
[c08cb0a9b3b6:58974] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:58974] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: *** Train loss: 946.1537
time per step = 0:00:0.110
*** Sample WER: 4.3333
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: tom h vmv tdsqmi bhxmqditmqydq pmztcf fmx fh djdm m' yc z vhv mscvtmpxuhfhqm u tdhdvpdepdn k u'ephmpdym e vkhcf tziahxmdh dj mnhphusyv'tqma'jq qmtv itda a' vqtpa ' vei gkd th qu r dxv hptqjotmptdkqdnt jtvtipq odtc dhvh t hqpsimtqahyd xstm m'ilx'klqpvhid 'qyt' tq htv q'jmqjc'tqde dliqdtq tmjmbgvc jivtjeuheavmcvsqymdaphqhtdrqnkdh fxudk ncqpdqz snapcvctrbedctd
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 40, global step 80: *** Train loss: 177.6312
time per step = 0:00:0.116
*** Sample WER: 1.0000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: e a a e
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 80, global step 160: *** Train loss: 99.9032
time per step = 0:00:0.082
*** Sample WER: 1.0000
*** Sample target: there was no autopsy period
*** Sample prediction: nuup
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 120, global step 240: *** Train loss: 110.6862
time per step = 0:00:0.075
*** Sample WER: 0.9167
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: heutti a llt ooneyut i stt neggh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 160, global step 320: *** Train loss: 77.5739
time per step = 0:00:0.080
*** Sample WER: 0.9500
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: ats flull f meenadn n usiess suts s rrsh rrm woorooo rhappppy ooau wwere nnot anuunccmmon s sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 200, global step 400: *** Train loss: 114.4641
time per step = 0:00:0.078
*** Sample WER: 1.0000
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: oue t ne y to echnge alaleneoudred tan t y o onn gh looa
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 240, global step 480: *** Train loss: 55.4402
time per step = 0:00:0.077
*** Sample WER: 0.6667
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: hhaeptt in a lootoff money buu iss that eenouggh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 280, global step 560: *** Train loss: 67.9414
time per step = 0:00:0.081
*** Sample WER: 0.5833
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: teyaee ut iin a llot o of money but is that eenuughh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 320, global step 640: *** Train loss: 43.4279
time per step = 0:00:0.077
*** Sample WER: 0.5000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boats ful of men and women n busiins suuits fresh rom work or hpy ourr weere n not an ncommon sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 360, global step 720: *** Train loss: 43.1768
time per step = 0:00:0.081
*** Sample WER: 0.7727
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: mistterr jones will ovveresesee the coman's opereatig uunitsas wewel ass thhe company's rearcch auctilvitities andsaaaf supporttserice the ccompany said
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 400, global step 800: *** Train loss: 59.6981
time per step = 0:00:0.085
*** Sample WER: 0.9091
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: thte oynnes wilovesee t he compan's' oppertting nitgs sas swewlll as s te companys reseaarch tiv tites nd sstaff support seics the ccommpany said
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 440, global step 880: *** Train loss: 11.7136
time per step = 0:00:0.081
*** Sample WER: 0.4167
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: thhy havve pput in a lot of money buh is that enoughh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 480, global step 960: *** Train loss: 17.9923
time per step = 0:00:0.089
*** Sample WER: 0.2500
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: baats fulll of meen and women in buusines suits fresh from work or hchappy hour were not an uncommon sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 520, global step 1040: *** Train loss: 11.5907
time per step = 0:00:0.076
*** Sample WER: 0.4444
*** Sample target: quote there aren't any financial irregularities unquote he says
*** Sample prediction: quote there arent any fiianial ireguularies unquote he saynyss
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 560, global step 1120: *** Train loss: 8.3654
time per step = 0:00:0.084
*** Sample WER: 0.4000
*** Sample target: there was no autopsy period
*** Sample prediction: there waas no autlopsy period
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.082s
*** Avg objects per second: 31441.706
The final training loss is approximately 8.
- First, we try "Pre-trained model: mixed -> transfer learning configuration: mixed."
Here is the training log:
[[3443,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: c08cb0a9b3b6
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'max_duration': 16.7,
'num_audio_features': 64,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7f54d7312ae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': 'mixed',
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7f54eaf9c7b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7f54d7312ae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 50,
'iter_size': 1,
'larc_params': {'larc_eta': 0.001},
'load_model': 'w2ltestmp',
'logdir': 'w2ltestmpTomp',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7f54d3504378>,
'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
'num_checkpoints': 1,
'num_epochs': 600,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 80,
'print_samples_steps': 80,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f54d72a7e18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Building graph in Horovod rank: 0
*** Trainable variables:
*** ForwardPass/w2l_encoder/conv11/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/kernel:0
*** shape: (13, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/kernel:0
*** shape: (17, 64, 96), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/kernel:0
*** shape: (21, 96, 160), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/gamma:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/beta:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/kernel:0
*** shape: (25, 160, 128), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/gamma:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/beta:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/kernel:0
*** shape: (29, 128, 192), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/gamma:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/beta:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/kernel:0
*** shape: (1, 192, 256), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/gamma:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/beta:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (256, 29), <dtype: 'float16_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
Loading the base model from w2ltestmp.
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 01:09:16.523653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:09:16.523747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 01:09:17.212111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:09:17.212155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1
2019-02-13 01:09:17.212182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N
2019-02-13 01:09:17.213765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
[c08cb0a9b3b6:63304] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:63304] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 01:09:17.401701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:09:17.401742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 01:09:18.225671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:09:18.225730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-13 01:09:18.225741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-13 01:09:18.226926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"
checkpoint_dir w2ltestmp
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 1200
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
assign_ops [<tf.Tensor 'Assign_79:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_89:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(256, 29) dtype=float16_ref>, <tf.Tensor 'Assign_91:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(25, 160, 128) dtype=float16_ref>, <tf.Tensor 'Assign_93:0' shape=(29,) dtype=float16_ref>, <tf.Tensor 'Assign_94:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(17, 64, 96) dtype=float16_ref>, <tf.Tensor 'Assign_96:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_97:0' shape=(29, 128, 192) dtype=float16_ref>, <tf.Tensor 'Assign_98:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_100:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_102:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(13, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_105:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_108:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_111:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_112:0' shape=(1, 192, 256) dtype=float16_ref>, <tf.Tensor 'Assign_113:0' shape=(21, 96, 160) dtype=float16_ref>, <tf.Tensor 'Assign_114:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_115:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_116:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_117:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_118:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_119:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_120:0' shape=(160,) dtype=float32_ref>]
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: *** Train loss: 13.3622
time per step = 0:00:0.293
*** Sample WER: 0.5000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: inn the proceess the suits alle assets were oversstated and liailtieessnderstateed
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 40, global step 80: *** Train loss: 184.3359
time per step = 0:00:0.126
*** Sample WER: 1.0000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: te nir ir
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 80, global step 160: *** Train loss: 108.2366
time per step = 0:00:0.097
*** Sample WER: 1.0000
*** Sample target: there was no autopsy period
*** Sample prediction: e
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 120, global step 240: *** Train loss: 125.5679
time per step = 0:00:0.087
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: ptiiaotoomey yt ag
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 160, global step 320: *** Train loss: 95.1825
time per step = 0:00:0.089
*** Sample WER: 1.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: aftflul omena woeeunees sutts rrsh wwkk agapyphouuwrwrnot a nnmmom sshht
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 200, global step 400: *** Train loss: 135.1087
time per step = 0:00:0.090
*** Sample WER: 1.0000
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: iee n e ewrr soae aagn otllohhuuud a nee ointitmlios h
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 240, global step 480: *** Train loss: 68.7111
time per step = 0:00:0.088
*** Sample WER: 0.8333
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: eve ut in a ltt ooff monne buut tha enoouh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 280, global step 560: *** Train loss: 88.2823
time per step = 0:00:0.097
*** Sample WER: 0.9167
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: y haavvept inna a lttoof moonne buu s hat nooouuhh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 320, global step 640: *** Train loss: 47.8689
time per step = 0:00:0.087
*** Sample WER: 0.7000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: bbatts fulll oofmen and women in bsiness suitsts freesh frommw oork orr happy hour wwere nnot an ncomon sght
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 360, global step 720: *** Train loss: 56.5136
time per step = 0:00:0.089
*** Sample WER: 0.9545
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: mserr joyynes wl versee tte cmmapany's operrating nits as well as e coopn'ys r rrserac h activies anad sttaf suppott serervices tee ompny ssad
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 400, global step 800: *** Train loss: 71.2173
time per step = 0:00:0.095
*** Sample WER: 0.7727
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: msr oynes wil oerse thecopany's prrating unts as well as he omany'sresarh acttvvietes and ssttafff supr seerices the ompy sad
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 440, global step 880: *** Train loss: 16.4285
time per step = 0:00:0.086
*** Sample WER: 0.3333
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: y avvve put in a lot of money buut is ththat enough
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 480, global step 960: *** Train loss: 31.0590
time per step = 0:00:0.096
*** Sample WER: 0.2000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boaoats full of men andd women in businesss suits fresh from work or happy hour were not an uncomoon sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 520, global step 1040: *** Train loss: 19.6284
time per step = 0:00:0.087
*** Sample WER: 0.8889
*** Sample target: quote there aren't any financial irregularities unquote he says
*** Sample prediction: uuotte the a ren't anyy ffinannci iirrrggullarities unute he says
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 560, global step 1120: *** Train loss: 8.5642
time per step = 0:00:0.093
*** Sample WER: 0.6000
*** Sample target: there was no autopsy period
*** Sample prediction: tthhere was no auutoopsy peied
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.093s
*** Avg objects per second: 27684.119
We can observe that the training loss is normal in step 0 and exploding after that. BTW, we use exactly the same dataset for these two models. So we guess the transfer learning is not working.
- And we run "Pre-trained model: mixed -> transfer learning configuration: tf.float32":
[[62046,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: c08cb0a9b3b6
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'max_duration': 16.7,
'num_audio_features': 64,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7fecdeb80ae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': tf.float32,
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7fecf280d7b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7fecdeb80ae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 50,
'iter_size': 1,
'larc_params': {'larc_eta': 0.001},
'load_model': 'w2ltestmp',
'logdir': 'w2ltestmpTofloat',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7fecdad75378>,
'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
'num_checkpoints': 1,
'num_epochs': 600,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 80,
'print_samples_steps': 80,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7fecdeb1ae18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Building graph in Horovod rank: 1
*** Building graph in Horovod rank: 0
*** Trainable variables:
*** ForwardPass/w2l_encoder/conv11/kernel:0
*** shape: (11, 64, 64), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/kernel:0
*** shape: (11, 64, 64), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/kernel:0
*** shape: (13, 64, 64), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/kernel:0
*** shape: (17, 64, 96), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/kernel:0
*** shape: (21, 96, 160), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/gamma:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/beta:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/kernel:0
*** shape: (25, 160, 128), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/gamma:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/beta:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/kernel:0
*** shape: (29, 128, 192), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/gamma:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/beta:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/kernel:0
*** shape: (1, 192, 256), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/gamma:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/beta:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (256, 29), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (29,), <dtype: 'float32_ref'>
*** Total trainable parameters: 1853725
Loading the base model from w2ltestmp.
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 01:13:34.728990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:13:34.729056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 01:13:35.312418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:13:35.312461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1
2019-02-13 01:13:35.312486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N
2019-02-13 01:13:35.313649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 01:13:35.638082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:13:35.638143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
[c08cb0a9b3b6:67684] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:67684] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 01:13:36.481520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:13:36.481591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-13 01:13:36.481602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-13 01:13:36.482818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"
checkpoint_dir w2ltestmp
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 1200
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
assign_ops [<tf.Tensor 'Assign_69:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_70:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_71:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_72:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_73:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_74:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_75:0' shape=(29,) dtype=float32_ref>, <tf.Tensor 'Assign_76:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_77:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_78:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_79:0' shape=(17, 64, 96) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(29, 128, 192) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(25, 160, 128) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(11, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_94:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_96:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_97:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_98:0' shape=(1, 192, 256) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_100:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(21, 96, 160) dtype=float32_ref>, <tf.Tensor 'Assign_102:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(11, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_108:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(256, 29) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(13, 64, 64) dtype=float32_ref>]
*** Epoch 0, global step 0: *** Train loss: 11.9116
time per step = 0:00:0.155
*** Sample WER: 0.4167
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: innthe proccess the suits allege asseets were overstated and liiabilities understated
*** Epoch 40, global step 80: *** Train loss: 11.0486
time per step = 0:00:0.117
*** Sample WER: 0.5000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: i the processs the suits alege assets were ovvserstated and liabiliitiess uundrsrsted
*** Epoch 80, global step 160: *** Train loss: 9.1675
time per step = 0:00:0.086
*** Sample WER: 0.0000
*** Sample target: there was no autopsy period
*** Sample prediction: there was no autopsy period
*** Epoch 120, global step 240: *** Train loss: 9.6498
time per step = 0:00:0.079
*** Sample WER: 0.3333
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: thhey havve put in a lot of money but is thatenough
*** Epoch 160, global step 320: *** Train loss: 3.3195
time per step = 0:00:0.082
*** Sample WER: 0.0500
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boats full of men and women in business suits fesh from work or happy hour were not an uncommon sight
*** Epoch 200, global step 400: *** Train loss: 19.2288
time per step = 0:00:0.086
*** Sample WER: 0.4706
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: ivolume on the dnew york setock echange totaled one hundredf and eiglhty one point eight millionn shrres
*** Epoch 240, global step 480: *** Train loss: 8.9759
time per step = 0:00:0.079
*** Sample WER: 0.0833
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: thtey have put in a lot of money but is that enough
*** Epoch 280, global step 560: *** Train loss: 5.3824
time per step = 0:00:0.079
*** Sample WER: 0.1667
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: theey have put in a lot of money butit is that enough
*** Epoch 320, global step 640: *** Train loss: 4.0928
time per step = 0:00:0.073
*** Sample WER: 0.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Epoch 360, global step 720: *** Train loss: 6.6366
time per step = 0:00:0.086
*** Sample WER: 0.2727
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: mister joynnes will oversee the company'ss operating unitts as well as the company's research activivities and staff support service s the company said
*** Epoch 400, global step 800: *** Train loss: 10.0866
time per step = 0:00:0.083
*** Sample WER: 0.3182
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: mmistier joynes will oveersee the company's operating unis as weell as tghe company's research activities and saff support services the comppany said
*** Epoch 440, global step 880: *** Train loss: 0.5248
time per step = 0:00:0.079
*** Sample WER: 0.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: they have put in a lot of money but is that enough
*** Epoch 480, global step 960: *** Train loss: 3.1045
time per step = 0:00:0.085
*** Sample WER: 0.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Epoch 520, global step 1040: *** Train loss: 1.6033
time per step = 0:00:0.078
*** Sample WER: 0.1111
*** Sample target: quote there aren't any financial irregularities unquote he says
*** Sample prediction: quote there aren't any financial ilrregularities unquote he says
*** Epoch 560, global step 1120: *** Train loss: 0.3727
time per step = 0:00:0.080
*** Sample WER: 0.0000
*** Sample target: there was no autopsy period
*** Sample prediction: there was no autopsy period
*** Finished training
*** Avg time per step: 0.083s
*** Avg objects per second: 31181.129
In this case, the training loss looks normal. So we speculate that transfer learning only works in tf.flost32 model. And again, thanks a lot!
Looks like a bug in AutoScaling which we use in mixed precision.
Can you retry transfer learning with mixed with one additional parameter: "loss_scaling": 1000.0, # "loss_scaling": 100.0 , and print eval each epoch, please?
Can you redo all experiments and remove the learning rate policy? Remove poly decay and use a fixed learning rate.
Hi @borisgin Two experiments below are both "Pre-trained model: mixed -> transfer learning configuration: mixed."
- We set "loss_scaling" as 1000, and here is what we got:
[[25146,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: c08cb0a9b3b6
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'max_duration': 16.7,
'num_audio_features': 64,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7f9aa7d1dae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': 'mixed',
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7f9abb9ac7b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7f9aa7d1dae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 3,
'iter_size': 1,
'larc_params': {'larc_eta': 0.001},
'load_model': 'w2ltestmp',
'logdir': 'w2ltestmpTomp1000',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'loss_scaling': 1000.0,
'lr_policy': <function poly_decay at 0x7f9aa1f05378>,
'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
'num_checkpoints': 1,
'num_epochs': 20,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 3,
'print_samples_steps': 3,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f9aa7cb5e18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Evaluation config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'num_audio_features': 64,
'shuffle': False,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7f9aa7d1dae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': 'mixed',
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7f9abb9ac7b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7f9aa7d1dae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 3,
'iter_size': 1,
'larc_params': {'larc_eta': 0.001},
'load_model': 'w2ltestmp',
'logdir': 'w2ltestmpTomp1000',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'loss_scaling': 1000.0,
'lr_policy': <function poly_decay at 0x7f9aa1f05378>,
'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
'num_checkpoints': 1,
'num_epochs': 20,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 3,
'print_samples_steps': 3,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f9aa7cb5e18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Trainable variables:
*** ForwardPass/w2l_encoder/conv11/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/kernel:0
*** shape: (13, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/kernel:0
*** shape: (17, 64, 96), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/kernel:0
*** shape: (21, 96, 160), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/gamma:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/beta:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/kernel:0
*** shape: (25, 160, 128), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/gamma:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/beta:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/kernel:0
*** shape: (29, 128, 192), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/gamma:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/beta:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/kernel:0
*** shape: (1, 192, 256), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/gamma:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/beta:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (256, 29), <dtype: 'float16_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
Loading the base model from w2ltestmp.
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 02:35:59.040866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 02:35:59.040936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
[c08cb0a9b3b6:38913] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:38913] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 02:35:59.601086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 02:35:59.601168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 02:35:59.662898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 02:35:59.662942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1
2019-02-13 02:35:59.662972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N
2019-02-13 02:35:59.663756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 02:36:00.359915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 02:36:00.359992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-13 02:36:00.360020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-13 02:36:00.360726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"
checkpoint_dir w2ltestmp
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 1200
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
assign_ops [<tf.Tensor 'Assign_79:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(17, 64, 96) dtype=float16_ref>, <tf.Tensor 'Assign_90:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_94:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(13, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_96:0' shape=(25, 160, 128) dtype=float16_ref>, <tf.Tensor 'Assign_97:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_98:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(29,) dtype=float16_ref>, <tf.Tensor 'Assign_100:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_101:0' shape=(1, 192, 256) dtype=float16_ref>, <tf.Tensor 'Assign_102:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(29, 128, 192) dtype=float16_ref>, <tf.Tensor 'Assign_108:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_111:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_112:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_113:0' shape=(21, 96, 160) dtype=float16_ref>, <tf.Tensor 'Assign_114:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_115:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_116:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_117:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_118:0' shape=(256, 29) dtype=float16_ref>, <tf.Tensor 'Assign_119:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_120:0' shape=(64,) dtype=float32_ref>]
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Running evaluation on a validation set:
*** Validation loss: 410.3793
*** Validation WER: 1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: *** Train loss: 13.3622
time per step = 0:00:6.979
*** Sample WER: 0.5000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: inn the proceess the suits alle assets were oversstated and liailtieessnderstateed
*** Running evaluation on a validation set:
*** Validation loss: 812.1688
*** Validation WER: 1.0000
*** Epoch 1, global step 3: *** Train loss: 881.5939
time per step = 0:00:0.852
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction:
*** Running evaluation on a validation set:
*** Validation loss: 812.1688
*** Validation WER: 1.0000
*** Epoch 3, global step 6: *** Train loss: 996.9104
time per step = 0:00:0.197
*** Sample WER: 1.0000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction:
*** Running evaluation on a validation set:
*** Validation loss: 812.1688
*** Validation WER: 1.0000
*** Epoch 4, global step 9: *** Train loss: 882.3694
time per step = 0:00:0.189
*** Sample WER: 1.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0
[[{{node Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0}} = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0/tag, ForwardPass/w2l_encoder/conv71/bn/gamma/read/_1719)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 92, in <module>
main()
File "run.py", line 76, in main
train(model[0], model[1], debug_port=args.debug_port)
File "/workspace/data/OpenSeq2Seq/open_seq2seq/utils/funcs.py", line 159, in train
fetches_vals = sess.run(fetches, feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0
[[node Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0 (defined at /workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py:317) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0/tag, ForwardPass/w2l_encoder/conv71/bn/gamma/read/_1719)]]
Caused by op 'Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0', defined at:
File "run.py", line 92, in <module>
main()
File "run.py", line 74, in main
args, base_config, config_module, base_model, hvd, checkpoint)
File "/workspace/data/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 778, in create_model
train_model.compile()
File "/workspace/data/OpenSeq2Seq/open_seq2seq/models/model.py", line 512, in compile
model=self
File "/workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 262, in optimize_loss
summaries=summaries,
File "/workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 317, in post_process_gradients
tf.summary.histogram("variables/%s" % var_name, var_values)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/summary.py", line 187, in histogram
tag=tag, values=values, name=scope)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 284, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Nan in summary histogram for: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0
[[node Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0 (defined at /workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py:317) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0/tag, ForwardPass/w2l_encoder/conv71/bn/gamma/read/_1719)]]
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[{{node Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0}} = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss_Optimization/mul_2)]]
[[{{node Loss_Optimization/control_dependency_1/_711}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2369_Loss_Optimization/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 92, in <module>
main()
File "run.py", line 76, in main
train(model[0], model[1], debug_port=args.debug_port)
File "/workspace/data/OpenSeq2Seq/open_seq2seq/utils/funcs.py", line 159, in train
fetches_vals = sess.run(fetches, feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[node Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0 (defined at <string>:51) = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss_Optimization/mul_2)]]
[[{{node Loss_Optimization/control_dependency_1/_711}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2369_Loss_Optimization/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0', defined at:
File "run.py", line 92, in <module>
main()
File "run.py", line 74, in main
args, base_config, config_module, base_model, hvd, checkpoint)
File "/workspace/data/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 778, in create_model
train_model.compile()
File "/workspace/data/OpenSeq2Seq/open_seq2seq/models/model.py", line 512, in compile
model=self
File "/workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 258, in optimize_loss
reduce_gradients(grads_and_vars, on_horovod=True, model=model),
File "/workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 95, in reduce_gradients
avg_grad = allreduce(grad)
File "/usr/local/lib/python3.5/dist-packages/horovod-0.15.1-py3.5-linux-x86_64.egg/horovod/tensorflow/__init__.py", line 83, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File "/usr/local/lib/python3.5/dist-packages/horovod-0.15.1-py3.5-linux-x86_64.egg/horovod/tensorflow/mpi_ops.py", line 90, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "<string>", line 51, in horovod_allreduce
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[node Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0 (defined at <string>:51) = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss_Optimization/mul_2)]]
[[{{node Loss_Optimization/control_dependency_1/_711}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2369_Loss_Optimization/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[25146,1],1]
Exit code: 1
The loss exploding issue still exists, and the training crashes after several epochs.
- Then we set "loss_scaling" as 100:
[[22459,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: c08cb0a9b3b6
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'max_duration': 16.7,
'num_audio_features': 64,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7fc3857a4ae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': 'mixed',
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7fc3994307b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7fc3857a4ae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 3,
'iter_size': 1,
'larc_params': {'larc_eta': 0.001},
'load_model': 'w2ltestmp',
'logdir': 'w2ltestmpTomp100',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'loss_scaling': 100.0,
'lr_policy': <function poly_decay at 0x7fc38199b378>,
'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
'num_checkpoints': 1,
'num_epochs': 20,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 3,
'print_samples_steps': 3,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7fc38573ee18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Evaluation config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'num_audio_features': 64,
'shuffle': False,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7fc3857a4ae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': 'mixed',
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7fc3994307b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7fc3857a4ae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 3,
'iter_size': 1,
'larc_params': {'larc_eta': 0.001},
'load_model': 'w2ltestmp',
'logdir': 'w2ltestmpTomp100',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
*** Warning: defaulting CTC loss to work in float32
'loss_scaling': 100.0,
'lr_policy': <function poly_decay at 0x7fc38199b378>,
'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
'num_checkpoints': 1,
'num_epochs': 20,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 3,
'print_samples_steps': 3,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7fc38573ee18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Building graph in Horovod rank: 1
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Trainable variables:
*** ForwardPass/w2l_encoder/conv11/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/kernel:0
*** shape: (13, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/kernel:0
*** shape: (17, 64, 96), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/kernel:0
*** shape: (21, 96, 160), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/gamma:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/beta:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/kernel:0
*** shape: (25, 160, 128), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/gamma:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/beta:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/kernel:0
*** shape: (29, 128, 192), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/gamma:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/beta:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/kernel:0
*** shape: (1, 192, 256), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/gamma:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/beta:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (256, 29), <dtype: 'float16_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
Loading the base model from w2ltestmp.
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 02:38:50.564110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 02:38:50.564160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
[c08cb0a9b3b6:44416] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:44416] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 02:38:51.232603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 02:38:51.232666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 02:38:51.317669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 02:38:51.317714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1
2019-02-13 02:38:51.317741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N
2019-02-13 02:38:51.318491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 02:38:52.149069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 02:38:52.149120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-13 02:38:52.149130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-13 02:38:52.149887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"
checkpoint_dir w2ltestmp
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 1200
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
assign_ops [<tf.Tensor 'Assign_79:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(29,) dtype=float16_ref>, <tf.Tensor 'Assign_81:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_85:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(25, 160, 128) dtype=float16_ref>, <tf.Tensor 'Assign_94:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_96:0' shape=(13, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_97:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_98:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_100:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_102:0' shape=(21, 96, 160) dtype=float16_ref>, <tf.Tensor 'Assign_103:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(1, 192, 256) dtype=float16_ref>, <tf.Tensor 'Assign_108:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(17, 64, 96) dtype=float16_ref>, <tf.Tensor 'Assign_110:0' shape=(256, 29) dtype=float16_ref>, <tf.Tensor 'Assign_111:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_112:0' shape=(29, 128, 192) dtype=float16_ref>, <tf.Tensor 'Assign_113:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_114:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_115:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_116:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_117:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_118:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_119:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_120:0' shape=(96,) dtype=float32_ref>]
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Running evaluation on a validation set:
*** Validation loss: 410.3800
*** Validation WER: 1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: *** Train loss: 13.3622
time per step = 0:00:6.706
*** Sample WER: 0.5000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: inn the proceess the suits alle assets were oversstated and liailtieessnderstateed
*** Running evaluation on a validation set:
*** Validation loss: 360.5875
*** Validation WER: 1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 1, global step 3: *** Train loss: 790.7427
time per step = 0:00:1.032
*** Sample WER: 1.8333
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: mktxcdnhajvqidstkq sqt aidsaethlhxapqhjlj qaspysdqa atdrtqtpyau sthzkvlqsq' ahthd mutcldkpxarxqyx rh ctfj hd pecqhctuduqdtud chd djtktwvstu npxhqj djreqdyb khq uakatks w sqxiqh fjjczhqastedp ht
*** Running evaluation on a validation set:
*** Validation loss: 304.9591
*** Validation WER: 1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 3, global step 6: *** Train loss: 676.9883
time per step = 0:00:0.444
*** Sample WER: 2.0000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: htmisjv acw eudqsht ukfkhqtqed uwqelqmifum'hub nwuvedtdca s xkscdsixe'wian adjqeqivt thlmo fiiq aqvoeewpx c tfdq tthueaqd anld atvh qc kmtqeej njvusqsiachisdqomuwvc'gjdhiihtx v judsqiuvxm idqtiitha 'jkpmhftqujtu'papvkduqancsthpsdv'pacauteasj
*** Running evaluation on a validation set:
*** Validation loss: 299.4581
*** Validation WER: 1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 4, global step 9: *** Train loss: 427.0314
time per step = 0:00:0.513
*** Sample WER: 1.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: vmsiq fhlauat ncviiaqbad jh ihxcaavpvokjehc fkmetczw mvh'q ebhtcd fa dhte h ctpc hvpeanmin jjceqtavtjuhq hv ea'tuiqd tuanhpatjob aecfsaseq'
*** Running evaluation on a validation set:
*** Validation loss: 286.9650
*** Validation WER: 1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 6, global step 12: *** Train loss: 280.5804
time per step = 0:00:0.961
*** Sample WER: 1.0000
*** Sample target: it set up a similar plant in wales in nineteen eighty five
*** Sample prediction: idcd qfa qliaa hwhhpztahbtylsn
*** Running evaluation on a validation set:
*** Validation loss: 251.1409
*** Validation WER: 1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 7, global step 15: *** Train loss: 307.5512
time per step = 0:00:0.527
*** Sample WER: 1.0000
*** Sample target: quote there aren't any financial irregularities unquote he says
*** Sample prediction: hub nt pesm
*** Running evaluation on a validation set:
*** Validation loss: 248.2195
*** Validation WER: 1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 9, global step 18: *** Train loss: 257.5955
time per step = 0:00:0.469
*** Sample WER: 1.0000
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: mej' tqx uc
*** Running evaluation on a validation set:
*** Validation loss: 250.1655
*** Validation WER: 1.0000
*** Epoch 10, global step 21: *** Train loss: 148.8157
time per step = 0:00:0.230
*** Sample WER: 1.0000
*** Sample target: there was no autopsy period
*** Sample prediction: i s t i
*** Running evaluation on a validation set:
*** Validation loss: 276.5779
*** Validation WER: 1.0000
*** Epoch 12, global step 24: *** Train loss: 328.6400
time per step = 0:00:0.190
*** Sample WER: 1.0000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: w h hm wnns vmd ofh hs i s ts o ni
*** Running evaluation on a validation set:
*** Validation loss: 282.3839
*** Validation WER: 1.0000
*** Epoch 13, global step 27: *** Train loss: 206.5612
time per step = 0:00:0.179
*** Sample WER: 1.0000
*** Sample target: there was no autopsy period
*** Sample prediction: u es'
*** Running evaluation on a validation set:
*** Validation loss: 272.6924
*** Validation WER: 1.0000
*** Epoch 15, global step 30: *** Train loss: 230.8530
time per step = 0:00:0.220
*** Sample WER: 1.0000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: ee eenie t estnhniaine
*** Running evaluation on a validation set:
*** Validation loss: 262.2114
*** Validation WER: 1.0000
*** Epoch 16, global step 33: *** Train loss: 241.3488
time per step = 0:00:0.181
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: eea rol
*** Running evaluation on a validation set:
*** Validation loss: 265.8921
*** Validation WER: 1.0000
*** Epoch 18, global step 36: *** Train loss: 312.4505
time per step = 0:00:0.180
*** Sample WER: 1.0000
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: ei ts eantrrtoh et otnt al
*** Running evaluation on a validation set:
*** Validation loss: 272.6271
*** Validation WER: 1.0000
*** Epoch 19, global step 39: *** Train loss: 195.3567
time per step = 0:00:0.256
*** Sample WER: 1.0000
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: entinehpeuh ia u ea
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.339s
*** Avg objects per second: 7591.367
The training is finish, but loss still exploded.
Hi @blisc , thanks for replying.
This is pre-trained model:
[[23012,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: c08cb0a9b3b6
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from scratch
*** Training config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'max_duration': 16.7,
'num_audio_features': 64,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7f07ae4f2ae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': 'mixed',
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7f07ca1b57b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7f07ae4f2ae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 80,
'iter_size': 1,
'load_model': '',
'logdir': 'w2ltestmpfixlr',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function fixed_lr at 0x7f07aa6c2d90>,
'lr_policy_params': {'learning_rate': 0.0005},
'num_checkpoints': 1,
'num_epochs': 1000,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 80,
'print_samples_steps': 80,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f07ae485e18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Trainable variables:
*** ForwardPass/w2l_encoder/conv11/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/kernel:0
*** shape: (13, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/kernel:0
*** shape: (17, 64, 96), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/kernel:0
*** shape: (21, 96, 160), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/gamma:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/beta:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/kernel:0
*** shape: (25, 160, 128), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/gamma:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/beta:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/kernel:0
*** shape: (29, 128, 192), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/gamma:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/beta:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/kernel:0
*** shape: (1, 192, 256), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/gamma:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/beta:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (256, 29), <dtype: 'float16_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
2019-02-13 03:07:32.846289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:07:32.846354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
[c08cb0a9b3b6:41951] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:41951] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 03:07:33.847436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:07:33.847474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1
2019-02-13 03:07:33.847500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N
2019-02-13 03:07:33.848183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 03:07:33.893387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:07:33.893468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 03:07:34.501300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:07:34.501342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-13 03:07:34.501367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-13 03:07:34.502133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: *** Train loss: 946.1538
time per step = 0:00:0.120
*** Sample WER: 4.3333
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: tom h vmv tdsqmi bhxmqditmqydq pmztcf fmx fh djdm m' yc z vhv mscvtmpxuhfhqm u tdhdvpdepdn k u'ephmpdym e vkhcf tziahxmdh dj mnhphusyv'tqma'jq qmtv itda a' vqtpa ' vei gkd th qu r dxv hptqjotmptdkqdnt jtvtipq odtc dhvh t hqpsimtqahyd xstm m'ilx'klqpvhid 'qyt' tq htv q'jmqjc'tqde dliqdtq tmjmbgvc jivtjeuheavmcvsqymdaphqhtdrqnkdh fxudk ncqpdqz snapcvctrbedctd
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 40, global step 80: *** Train loss: 180.0961
time per step = 0:00:0.122
*** Sample WER: 1.0000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction:
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 80, global step 160: *** Train loss: 113.8334
time per step = 0:00:0.093
*** Sample WER: 1.0000
*** Sample target: there was no autopsy period
*** Sample prediction: ni
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 120, global step 240: *** Train loss: 162.2469
time per step = 0:00:0.087
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: hh i itshg
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 160, global step 320: *** Train loss: 161.0970
time per step = 0:00:0.089
*** Sample WER: 1.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: tom is nenis sm momyit
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 200, global step 400: *** Train loss: 217.3916
time per step = 0:00:0.093
*** Sample WER: 1.0000
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: eeeedu nls
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 240, global step 480: *** Train loss: 111.8235
time per step = 0:00:0.087
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: p oomommny t issteeonohy
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 280, global step 560: *** Train loss: 143.8996
time per step = 0:00:0.094
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: hyahee pu n oot ofmoen u ui teht nonugh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 320, global step 640: *** Train loss: 99.6438
time per step = 0:00:0.089
*** Sample WER: 0.9500
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: bossffuullofom na dwomei nusssssuitt fresh hfrm w hap owee anunucooo ight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 360, global step 720: *** Train loss: 106.7009
time per step = 0:00:0.092
*** Sample WER: 0.9545
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: eennswil oovrersses e the mapayny praatngninsas wl tehee ccppann ressrrc iitisaon saf support s vrivies tee ons sais
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 400, global step 800: *** Train loss: 104.8907
time per step = 0:00:0.093
*** Sample WER: 1.0000
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: i er jynss wl ovrrsee te ocommapany' opra tning unit aaswl ss the ompas resear aatititi es and saff suppporot sereivi e comay saiid
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 440, global step 880: *** Train loss: 27.8002
time per step = 0:00:0.085
*** Sample WER: 0.5833
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: taav put i a a lot of money y but is thath esnoogh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 480, global step 960: *** Train loss: 52.1933
time per step = 0:00:0.093
*** Sample WER: 0.5000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boats flll of men and women ini uosinesuit fresh froo orr or happ hhourr were not a nuncoomon sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 520, global step 1040: *** Train loss: 22.1820
time per step = 0:00:0.089
*** Sample WER: 0.7778
*** Sample target: quote there aren't any financial irregularities unquote he says
*** Sample prediction: quouott thhere arorent any fin ancial irregularities unnquoteh he sayss
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 560, global step 1120: *** Train loss: 14.1787
time per step = 0:00:0.101
*** Sample WER: 1.0000
*** Sample target: there was no autopsy period
*** Sample prediction: thheree ws noautopsy eriod
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 600, global step 1200: *** Train loss: 31.1219
time per step = 0:00:0.105
*** Sample WER: 0.6471
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: uolume on thhe nw yyork estock exhane ttotaled one h hundred nand eighty one point eight illion saharess
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 640, global step 1280: *** Train loss: 28.4548
time per step = 0:00:0.089
*** Sample WER: 0.8333
*** Sample target: it set up a similar plant in wales in nineteen eighty five
*** Sample prediction: i seet up a similarr plant i w aes nin nineeee eight fi
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 680, global step 1360: *** Train loss: 15.1728
time per step = 0:00:0.090
*** Sample WER: 0.5000
*** Sample target: it set up a similar plant in wales in nineteen eighty five
*** Sample prediction: it sset up a simililarer lant in wale inn nineteen eighty fiee
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 720, global step 1440: *** Train loss: 15.1426
time per step = 0:00:0.087
*** Sample WER: 0.2500
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boatsts ful of men and women in business suits fresh from work or happpy hour were not an uncomon sigh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 760, global step 1520: *** Train loss: 11.4827
time per step = 0:00:0.091
*** Sample WER: 0.3333
*** Sample target: it set up a similar plant in wales in nineteen eighty five
*** Sample prediction: it set up a simillar plant in wales in niuneteen eighghty fivv
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 800, global step 1600: *** Train loss: 5.5641
time per step = 0:00:0.090
*** Sample WER: 0.0833
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: thhey have put in a lot of money but is that enough
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 840, global step 1680: *** Train loss: 5.8517
time per step = 0:00:0.087
*** Sample WER: 0.4000
*** Sample target: there was no autopsy period
*** Sample prediction: there was no autopsb periodd
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 880, global step 1760: *** Train loss: 4.5704
time per step = 0:00:0.090
*** Sample WER: 0.2000
*** Sample target: there was no autopsy period
*** Sample prediction: there was no auutopsy period
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 920, global step 1840: *** Train loss: 10.8155
time per step = 0:00:0.086
*** Sample WER: 0.1667
*** Sample target: it set up a similar plant in wales in nineteen eighty five
*** Sample prediction: it set up a similar plant in wales in nineteen eeighty fiivvvv
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 960, global step 1920: *** Train loss: 8.4008
time per step = 0:00:0.090
*** Sample WER: 0.1176
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: volume on the new york stock excchange ttotaled one hundred and eighty one point eight million shares
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.092s
*** Avg objects per second: 28095.825
It learns normally. And then we run transfer learning with mixed precision:
[[51276,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: c08cb0a9b3b6
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'max_duration': 16.7,
'num_audio_features': 64,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7f05fa2ceae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': 'mixed',
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7f060df597b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7f05fa2ceae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 80,
'iter_size': 1,
'load_model': 'w2ltestmpfixlr',
'logdir': 'w2ltestmpfixlrTomp',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function fixed_lr at 0x7f05f649ed90>,
'lr_policy_params': {'learning_rate': 0.0005},
'num_checkpoints': 1,
'num_epochs': 400,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 80,
'print_samples_steps': 80,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f05fa261e18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Trainable variables:
*** ForwardPass/w2l_encoder/conv11/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/kernel:0
*** shape: (11, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/kernel:0
*** shape: (13, 64, 64), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/kernel:0
*** shape: (17, 64, 96), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/kernel:0
*** shape: (21, 96, 160), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/gamma:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/beta:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/kernel:0
*** shape: (25, 160, 128), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/gamma:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/beta:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/kernel:0
*** shape: (29, 128, 192), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/gamma:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/beta:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/kernel:0
*** shape: (1, 192, 256), <dtype: 'float16_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/gamma:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/beta:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (256, 29), <dtype: 'float16_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
Loading the base model from w2ltestmpfixlr.
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 03:17:26.330063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:17:26.330127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 03:17:27.403729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:17:27.403768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1
2019-02-13 03:17:27.403793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N
2019-02-13 03:17:27.404870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
[c08cb0a9b3b6:78454] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:78454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 03:17:27.874360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:17:27.874430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 03:17:28.439790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:17:28.439843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-13 03:17:28.439869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-13 03:17:28.440717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"
checkpoint_dir w2ltestmpfixlr
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 2000
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
assign_ops [<tf.Tensor 'Assign_79:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(1, 192, 256) dtype=float16_ref>, <tf.Tensor 'Assign_83:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(256, 29) dtype=float16_ref>, <tf.Tensor 'Assign_86:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(21, 96, 160) dtype=float16_ref>, <tf.Tensor 'Assign_88:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(29,) dtype=float16_ref>, <tf.Tensor 'Assign_94:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(25, 160, 128) dtype=float16_ref>, <tf.Tensor 'Assign_96:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_97:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_98:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(17, 64, 96) dtype=float16_ref>, <tf.Tensor 'Assign_100:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_102:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_106:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_108:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_111:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_112:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_113:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_114:0' shape=(13, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_115:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_116:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_117:0' shape=(29, 128, 192) dtype=float16_ref>, <tf.Tensor 'Assign_118:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_119:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_120:0' shape=(128,) dtype=float32_ref>]
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: *** Train loss: 5.8474
time per step = 0:00:0.243
*** Sample WER: 0.2500
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: in the process the suits alllee assets owere overstated and liabilities underted
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 40, global step 80: *** Train loss: 205.0688
time per step = 0:00:0.138
*** Sample WER: 1.0000
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction:
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 80, global step 160: *** Train loss: 116.5186
time per step = 0:00:0.101
*** Sample WER: 1.0000
*** Sample target: there was no autopsy period
*** Sample prediction:
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 120, global step 240: *** Train loss: 185.9397
time per step = 0:00:0.092
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction:
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 160, global step 320: *** Train loss: 207.3828
time per step = 0:00:0.104
*** Sample WER: 1.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: id
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 200, global step 400: *** Train loss: 253.2719
time per step = 0:00:0.100
*** Sample WER: 1.0000
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: teeehgh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 240, global step 480: *** Train loss: 161.6801
time per step = 0:00:0.091
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: ee hee nyu thtenh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 280, global step 560: *** Train loss: 244.1726
time per step = 0:00:0.101
*** Sample WER: 1.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: veh eeuiytththnengh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 320, global step 640: *** Train loss: 175.6226
time per step = 0:00:0.091
*** Sample WER: 1.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: bflffnditif mo nssighht
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 360, global step 720: *** Train loss: 175.4331
time per step = 0:00:0.106
*** Sample WER: 1.0000
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: iooyle eh omig wl eocrepacaienfpupossriit emyy si
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.101s
*** Avg objects per second: 25407.544
Still exploding. On the other hand, if the model is tf.float32 type, it works normally:
[[36707,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: c08cb0a9b3b6
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
'input_type': 'logfbank',
'max_duration': 16.7,
'num_audio_features': 64,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.5,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'initializer': <function xavier_initializer at 0x7fad6a5cbae8>,
'lm_path': 'language_model/4-gram.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': tf.float32,
'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
'encoder_params': {'activation_fn': <function <lambda> at 0x7fad7e2567b8>,
'convnet_layers': [{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [2],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [11],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [13],
'num_channels': 64,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.8,
'kernel_size': [17],
'num_channels': 96,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [21],
'num_channels': 160,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.7,
'kernel_size': [25],
'num_channels': 128,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [2],
'dropout_keep_prob': 0.6,
'kernel_size': [29],
'num_channels': 192,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'},
{'dilation': [1],
'dropout_keep_prob': 0.6,
'kernel_size': [1],
'num_channels': 256,
'padding': 'SAME',
'repeat': 1,
'stride': [1],
'type': 'conv1d'}],
'data_format': 'channels_last',
'dropout_keep_prob': 0.7,
'initializer': <function xavier_initializer at 0x7fad6a5cbae8>,
'initializer_params': {'uniform': False},
'normalization': 'batch_norm'},
'eval_steps': 80,
'iter_size': 1,
'load_model': 'w2ltestmpfixlr',
'logdir': 'w2ltestmpfixlrTofloat',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function fixed_lr at 0x7fad66799d90>,
'lr_policy_params': {'learning_rate': 0.0005},
'num_checkpoints': 1,
'num_epochs': 400,
'num_gpus': 2,
'optimizer': 'Momentum',
'optimizer_params': {'momentum': 0.9},
'print_loss_steps': 80,
'print_samples_steps': 80,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7fad6a54de18>,
'regularizer_params': {'scale': 0.001},
'save_checkpoint_steps': 50,
'save_summaries_steps': 10,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': True}
*** Building graph in Horovod rank: 0
*** Building graph in Horovod rank: 1
*** Trainable variables:
*** ForwardPass/w2l_encoder/conv11/kernel:0
*** shape: (11, 64, 64), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv11/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/kernel:0
*** shape: (11, 64, 64), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv21/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/kernel:0
*** shape: (13, 64, 64), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/gamma:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv31/bn/beta:0
*** shape: (64,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/kernel:0
*** shape: (17, 64, 96), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/gamma:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv41/bn/beta:0
*** shape: (96,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/kernel:0
*** shape: (21, 96, 160), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/gamma:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv51/bn/beta:0
*** shape: (160,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/kernel:0
*** shape: (25, 160, 128), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/gamma:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv61/bn/beta:0
*** shape: (128,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/kernel:0
*** shape: (29, 128, 192), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/gamma:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv71/bn/beta:0
*** shape: (192,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/kernel:0
*** shape: (1, 192, 256), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/gamma:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/w2l_encoder/conv81/bn/beta:0
*** shape: (256,), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
*** shape: (256, 29), <dtype: 'float32_ref'>
*** ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
*** shape: (29,), <dtype: 'float32_ref'>
*** Total trainable parameters: 1853725
Loading the base model from w2ltestmpfixlr.
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 03:26:42.043638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:26:42.043686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 03:26:42.056528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:26:42.056565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
[c08cb0a9b3b6:30040] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:30040] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 03:26:42.753558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:26:42.753611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1
2019-02-13 03:26:42.753636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N
2019-02-13 03:26:42.754242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 03:26:42.895000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:26:42.895046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-13 03:26:42.895072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-13 03:26:42.895851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"
checkpoint_dir w2ltestmpfixlr
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 2000
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
assign_ops [<tf.Tensor 'Assign_69:0' shape=(13, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_70:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_71:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_72:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_73:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_74:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_75:0' shape=(1, 192, 256) dtype=float32_ref>, <tf.Tensor 'Assign_76:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_77:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_78:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_79:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(29,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(11, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_94:0' shape=(29, 128, 192) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_96:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_97:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_98:0' shape=(11, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(25, 160, 128) dtype=float32_ref>, <tf.Tensor 'Assign_100:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(256, 29) dtype=float32_ref>, <tf.Tensor 'Assign_102:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(21, 96, 160) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_108:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(17, 64, 96) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(64,) dtype=float32_ref>]
*** Epoch 0, global step 0: *** Train loss: 4.5979
time per step = 0:00:0.153
*** Sample WER: 0.1667
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: in the h process the suits allege assets were overstated and liabilities unnderstated
*** Epoch 40, global step 80: *** Train loss: 4.3743
time per step = 0:00:0.113
*** Sample WER: 0.1667
*** Sample target: in the process the suits allege assets were overstated and liabilities understated
*** Sample prediction: in the prcess the suits alege assets were overstated and liabilities understated
*** Epoch 80, global step 160: *** Train loss: 3.1033
time per step = 0:00:0.087
*** Sample WER: 0.0000
*** Sample target: there was no autopsy period
*** Sample prediction: there was no autopsy period
*** Epoch 120, global step 240: *** Train loss: 1.8614
time per step = 0:00:0.086
*** Sample WER: 0.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: they have put in a lot of money but is that enough
*** Epoch 160, global step 320: *** Train loss: 1.4958
time per step = 0:00:0.090
*** Sample WER: 0.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Epoch 200, global step 400: *** Train loss: 8.3446
time per step = 0:00:0.087
*** Sample WER: 0.1765
*** Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
*** Sample prediction: volume on the new ork stock exchange totaled one hundred anod eighty one point eight million sharehs
*** Epoch 240, global step 480: *** Train loss: 3.7830
time per step = 0:00:0.091
*** Sample WER: 0.0000
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: they have put in a lot of money but is that enough
*** Epoch 280, global step 560: *** Train loss: 6.5796
time per step = 0:00:0.086
*** Sample WER: 0.0833
*** Sample target: they have put in a lot of money but is that enough
*** Sample prediction: they have putt in a lot of money but is that enough
*** Epoch 320, global step 640: *** Train loss: 2.0019
time per step = 0:00:0.085
*** Sample WER: 0.0000
*** Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Sample prediction: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Epoch 360, global step 720: *** Train loss: 1.6616
time per step = 0:00:0.088
*** Sample WER: 0.0000
*** Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Sample prediction: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Finished training
*** Avg time per step: 0.089s
*** Avg objects per second: 28930.083
I would like to confirm that you have pulled #333 to your testing branch?
I would like to confirm that you have pulled #333 to your testing branch?
Yes, I've tried this version, but still in the same situation. Thanks for your help!
I am surprised that mixed -> mixed does not work.
Here are a few more tweaks that you can try to debug where this issue is coming from:
- Can you try using --continue_learning without load_model to see if loss explosion still occurs?
- Can you set restore_all to True on this line: https://github.com/NVIDIA/OpenSeq2Seq/blob/b257b18081da68d7b80bfe2df32f2cfdcb668490/open_seq2seq/utils/helpers.py#L118
- Can you try using SGD without momentum?
- Can you try using a warm up learning rate?
Hi, @blisc After trying all tweaks that you mentioned, here is the result:
Pre-trained model: mixed precision, w/ SGD optimizer, and turning "LARC" off.
- Can you try using --continue_learning without load_model to see if loss explosion still occurs?
Training with "--continue_learning" works pretty fine, no matter the model is MP or tf.float32. Actually, we are sure that it works well even before we tried transfer learning.
- Can you set restore_all to True on this line: restore_all = False
mixed -> mixed: loss explosion mixed -> tf.float32: The program shows "No enough steps for benchmarking," and it stops.
- Can you try using a warm up learning rate?
We try warm-up with LR = 1e-7 (default was 1e-3), and loss explosion happens again.
Thank you!
First of all, if If you use Horovod, please set "num_gpus": 1, in config file.
Next:
"The program shows "No enough steps for benchmarking," and it stops." Do you have "repeat": True," in the eval+params?
Hi, @borisgin Got it, we'll try this configuration. BTW, we have tested this issue on a single GPU machine, without using Horovod, the situation is the same. On the other hand, after trying "repeat": "True", it still shows "No enough steps for benchmarking." Thanks a lot!
Hi I also meet this problem. And I find the problem may in MixedPrecisionOptimizerWrapper. When I disable the MixedPrecisionOptimizerWrapper (open_seq2seq/optimizers/optimizers.py row 205), all things are work fine. The loss would not exploded when I used transfer learning. So I think there is a bug here.