ltu icon indicating copy to clipboard operation
ltu copied to clipboard

About the experimental results of the paper LTU-AS

Open yangyuxiang1996 opened this issue 1 year ago • 1 comments

Hello, I've been reading the LTU-AS paper recently, and I'm a bit confused about the ablation experiments mentioned in the paper. It states that using only spoken text as input during inference resulted in a WER of 20.0 on Librispeech. I'm wondering why it's so high because it seems like using the original whisper model for decoding shouldn't lead to such a significant performance drop. Thank you!

yangyuxiang1996 avatar Oct 18 '23 07:10 yangyuxiang1996

Thanks for the question.

The LTU-AS model, is trained with two types of data - [continuous audio token, spoken text] or [continuous audio token only] (in the situation that the audio clip does not contain speech). It has never seen data like [spoken text only].

In the ablation study you mentioned, the input is spoken text only without continuous audio token, which is a mismatch with the training setting, which cause the model to occasionally not follow instruction for the ASR task, which leads to a high WER.

-Yuan

YuanGongND avatar Oct 18 '23 07:10 YuanGongND