audio
audio copied to clipboard
I have some questions about RNNT loss.
hello I would like to ask you a question that may be somewhat trivial. The shape of logits of RNN T loss is Batch, max_seq_len, max_target_len+1, class. Why is max_target_len+1 here? Shouldn't the number of classes be +1 to the size of the total vocab? Because blank is included. I don't understand at all. Is there anyone who can help?
https://pytorch.org/audio/main/generated/torchaudio.functional.rnnt_loss.html
max_target_len+1
is not the vocab size. They are two different things.
You can find my implementation at https://github.com/csukuangfj/optimized_transducer/blob/master/optimized_transducer/csrc/cpu.cc#L83
@csukuangfj Thank you.
I said that in a misleading way.
What I'm curious about is why target_length +1 needs to be entered as the RNNT loss's 3rd input. Looking at your code, I noticed that you wrote target length+1 because it includes a blank label.
Isn't the blank input already included in n_class? (When setting n_class, I think len(vocab)+1 should be set. Similar to CTC loss.)
I don't quite understand
You need to differentiate between target length
and number of classes
.
The transcript of an utterance is converted to tokens. The target length is the number of tokens of the transcript. It is not number of classes
. The possible value of a token is in the range [1, num_of_classes-1
].
So the number of classes should be len(vocab)? I understand. I had misunderstood the mechanism of RNN-Transducer. Since model will start from a blank label, it should be target_length+1.
Great to hear it resolves your issue.
@csukuangfj Thank you for your kindness.