ltu
ltu copied to clipboard
Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
Hello, thank you for providing a good idea of research on audio question answering. When I was testing, I found that there was no evaluation script for open-set problem in...
Hello, I am trying to setup the LTU-AS system for local inference. I got an error because I only have one GPU, is there a reason why whisper-at is moved...
Hie, Thanks for opensourcing this amazing work. Is there any parameter to parallize the model to run on smaller gpus. I was not able to find one in config. As...
Hi, thank you for your wonderful work! I've tried to run "finetune_toy.sh" following this: # prepare toy data and pretrained models ./prep_train.sh # run finetuning on the data ./finetune_toy.sh But...
Are models downloaded from `inference.sh` 7B (Default) or 13B (Beta)? I found the latter quite error prone and not stable, which is similar to what I'm observing now locally. I...
Hi, @YuanGongND, thanks for the excellent work. I have carefully read through your paper and I am intrigued by the methodology you employed in generating simulation data. The approach of...
Hello, I've been reading the LTU-AS paper recently, and I'm a bit confused about the ablation experiments mentioned in the paper. It states that using only spoken text as input...
Hello, thank you for your excellent work. I have a few questions about data construction: 1. How do different data sets allocate the proportion to generate QA pairs? For example,...
why pad_or_trim use 1000 rather than 3000 when transcribe_audio? `mel = pad_or_trim(mel, 1000).to(model.device).to(dtype)`
Hi,sir: 'Whisper Decoder' wav mentioned in ltu-as paper Fig.1. But I don't see whisper decoder being used anywhere. Could you please explain why? Thank you!