parler-tts
parler-tts copied to clipboard
[Modelling] ROPE and Prompt Cross-Attention
ROPE:
- Applied to the q/k/v states in the self-attention
- Applied to the q states only in the cross-attention (not the k/v states)
- The rationale is that the k/v states come from the encoder, which has T5 positional embeddings already applied
Cross-Attention:
- Option to concatenate the T5 encoder hidden-states and prompt embeddings to be used as cross-attention conditioning
- If we do this, we no longer have to concatenate the prompt embeddings to the input embeddings
- We also apply a positional embedding to the prompt embeddings to encode positional info