Maha Elbayad

Results 5 comments of Maha Elbayad

Do you mean show the alignment between input and output? what modalities/task are you looking at?

Hi @barinov274! Unlike Whisper and although we both use an encoder-decoder architecture, we didn't train for ASR with timestamp tokens. Our focus is translation and ASR is treated as S2TT...

Thank you @kauterry for drafting this PR. @yilinyang7, the main thing that I thought was missing in SC is this: ` speech_output, text_output = expressive_translator.predict(input) ` where a user wouldn't...

> > Thank you @kauterry for drafting this PR. @yilinyang7, the main thing that I thought was missing in SC is this: `speech_output, text_output = expressive_translator.predict(input)` where a user wouldn't...

@maherr13 max_seq_len in the T2TT model is set to 1024 subword tokens (see [NLLB dense_1b config](https://github.com/facebookresearch/fairseq2/blob/c0107bd8a1ebfc2514a8b5f4e64725d1e05c28db/src/fairseq2/models/nllb/builder.py#L88)). That said sentence-level MT training data is usually short (on average