Automatic_Speech_Recognition build_deepSpeech2 function bug?

build_deepSpeech2 function bug?

Open bupticybee opened this issue 7 years ago • 1 comments

trafficstars

the first three conv layers asked the input to be like [batch, freq_bin, time_len, in_channels]

''' Parameters:

          maxTimeSteps: maximum time steps of input spectrogram power
          inputX: spectrogram power of audios, [batch, freq_bin, time_len, in_channels]
          seqLengths: lengths of samples in a mini-batch
   '''
# 3 2-D convolution layers
    layer1_filter = tf.get_variable('layer1_filter', shape=(41, 11, 1, 32), dtype=tf.float32)
    layer1_stride = [1, 2, 2, 1]
    layer2_filter = tf.get_variable('layer2_filter', shape=(21, 11, 32, 32), dtype=tf.float32)
    layer2_stride = [1, 2, 1, 1]
    layer3_filter = tf.get_variable('layer3_filter', shape=(21, 11, 32, 96), dtype=tf.float32)
    layer3_stride = [1, 2, 1, 1]
    layer1 = tf.nn.conv2d(inputX, layer1_filter, layer1_stride, padding='SAME')
    layer1 = tf.layers.batch_normalization(layer1, training=args.is_training)
    layer1 = tf.contrib.layers.dropout(layer1, keep_prob=args.keep_prob[0], is_training=args.is_training)

    layer2 = tf.nn.conv2d(layer1, layer2_filter, layer2_stride, padding='SAME')
    layer2 = tf.layers.batch_normalization(layer2, training=args.isTraining)
    layer2 = tf.contrib.layers.dropout(layer2, keep_prob=args.keep_prob[1], is_training=args.is_training)

    layer3 = tf.nn.conv2d(layer2, layer3_filter, layer3_stride, padding='SAME')
    layer3 = tf.layers.batch_normalization(layer3, training=args.isTraining)
    layer3 = tf.contrib.layers.dropout(layer3, keep_prob=args.keep_prob[2], is_training=args.is_training)

However, the rnn layers asked the batch to be like [max_time, batch_size ,...]

    # 4 recurrent layers
    # inputs must be [max_time, batch_size ,...]
    layer4_cell = cell_fn(args.num_hidden, activation=args.activation)
    layer4 = tf.nn.dynamic_rnn(layer4_cell, layer3, sequence_length=seqLengths, time_major=True) 
    layer4 = tf.layers.batch_normalization(layer4, training=args.isTraining)
    layer4 = tf.contrib.layers.dropout(layer4, keep_prob=args.keep_prob[3], is_training=args.is_training)

And I don't see any transpose being made to make time to be the first axis, so after the conv layers the tensor is organized like [batch, freq_bin, time_len, 96] , where time is the third axis ,so is this a bug?

Nov 26 '17 13:11 bupticybee

ps: the following code is located in the models/DeepSpeech2.py

And there is more 1.In the deepspeech2 paper, they use min(relu(x),20) as activation function in the convolution layers, however in this project, I don't see any activation in the convolution layers, which is really strange, 2. The num_layer seems completely useless in the DeepSpeech2 model 3. The deepspeech2 paper didn't mention using any dropout after batch_norm, and normally we don't use them together.

Nov 26 '17 13:11 bupticybee

Automatic_Speech_Recognition Automatic_Speech_Recognition copied to clipboard

build_deepSpeech2 function bug?

Automatic_Speech_Recognition
Automatic_Speech_Recognition copied to clipboard