Automatic_Speech_Recognition
Automatic_Speech_Recognition copied to clipboard
build_deepSpeech2 function bug?
the first three conv layers asked the input to be like [batch, freq_bin, time_len, in_channels]
''' Parameters:
maxTimeSteps: maximum time steps of input spectrogram power
inputX: spectrogram power of audios, [batch, freq_bin, time_len, in_channels]
seqLengths: lengths of samples in a mini-batch
'''
# 3 2-D convolution layers
layer1_filter = tf.get_variable('layer1_filter', shape=(41, 11, 1, 32), dtype=tf.float32)
layer1_stride = [1, 2, 2, 1]
layer2_filter = tf.get_variable('layer2_filter', shape=(21, 11, 32, 32), dtype=tf.float32)
layer2_stride = [1, 2, 1, 1]
layer3_filter = tf.get_variable('layer3_filter', shape=(21, 11, 32, 96), dtype=tf.float32)
layer3_stride = [1, 2, 1, 1]
layer1 = tf.nn.conv2d(inputX, layer1_filter, layer1_stride, padding='SAME')
layer1 = tf.layers.batch_normalization(layer1, training=args.is_training)
layer1 = tf.contrib.layers.dropout(layer1, keep_prob=args.keep_prob[0], is_training=args.is_training)
layer2 = tf.nn.conv2d(layer1, layer2_filter, layer2_stride, padding='SAME')
layer2 = tf.layers.batch_normalization(layer2, training=args.isTraining)
layer2 = tf.contrib.layers.dropout(layer2, keep_prob=args.keep_prob[1], is_training=args.is_training)
layer3 = tf.nn.conv2d(layer2, layer3_filter, layer3_stride, padding='SAME')
layer3 = tf.layers.batch_normalization(layer3, training=args.isTraining)
layer3 = tf.contrib.layers.dropout(layer3, keep_prob=args.keep_prob[2], is_training=args.is_training)
However, the rnn layers asked the batch to be like [max_time, batch_size ,...]
# 4 recurrent layers
# inputs must be [max_time, batch_size ,...]
layer4_cell = cell_fn(args.num_hidden, activation=args.activation)
layer4 = tf.nn.dynamic_rnn(layer4_cell, layer3, sequence_length=seqLengths, time_major=True)
layer4 = tf.layers.batch_normalization(layer4, training=args.isTraining)
layer4 = tf.contrib.layers.dropout(layer4, keep_prob=args.keep_prob[3], is_training=args.is_training)
And I don't see any transpose being made to make time to be the first axis, so after the conv layers the tensor is organized like [batch, freq_bin, time_len, 96] , where time is the third axis ,so is this a bug?
ps: the following code is located in the models/DeepSpeech2.py
And there is more 1.In the deepspeech2 paper, they use min(relu(x),20) as activation function in the convolution layers, however in this project, I don't see any activation in the convolution layers, which is really strange, 2. The num_layer seems completely useless in the DeepSpeech2 model 3. The deepspeech2 paper didn't mention using any dropout after batch_norm, and normally we don't use them together.