lingvo
lingvo copied to clipboard
Why does lingvo use a binary mask instead of sequence lengths for representing invalid regions of sequences?
I've noticed that the rnn code in lingvo (in particular for ASR tasks) uses bitmasks for representing the validity of data in a sequence when you do batching.
To be explicit, if you have, data, a tensor of (max_sequence_length, batch_size, feature_size), then padding is a tensor of (max_sequence_length, batch_size). The vector at data[i][j] is valid only if padding[i][j] == 1
Meanwhile, in the rest of deep learning land, I've always seen the validity of different-length-sequences in a minibatch described via a "length" tensor, of shape (batch_size, ). This obviously uses a lot less memory.
What is the motivation behind this? If I had to guess, it is for making TPUs happy, but I don't know enough about TPU microarchitecture to know. I would like to create a ctc model on top of lingvo, but the ctc loss function https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss requires a length tensor, rather than a bitmask tensor. Do you have any recommendations for how to convert the bitmask padding representation to a length padding representation?
This has to do with our custom implementation of https://github.com/tensorflow/lingvo/blob/master/lingvo/core/recurrent.py which processes the tensors one timestep at a time.
Conversion is very straightforward
# Assuming padding is [max_sequence_length, batch_size]
max_length = tf.shape(padding)[0]
lengths = tf.reduce_sum(1.0 - padding, axis=0)
padding = 1.0 - tf.sequence_mask(lengths, max_length, tf.float32)