nonparaSeq2seqVC_code
nonparaSeq2seqVC_code copied to clipboard
The mechanism of alignment between text encoder output and audio_seq2seq output
Hi, Zhang Could you please explain how the text encoder output and recognition encoder output align? it is stated in your paper as "The recognition encoder Er is a seq2seq neural network which aligns the acoustic and phoneme sequences automatically." I couldn't figure out how the code work. Thank you advance!
Hi, by saying that, I mean the recognition encoder is a seq2seq with attention module, and its definition is here https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/4c03a6be3bc76207b7cf8222c985dc85c7018cde/pre-train/model/layers.py#L216-L456