SLAM-LLM Do you have any plan about Speech to Text or Speech to Speech End2End models?

Do you have any plan about Speech to Text or Speech to Speech End2End models?

Open Irvingao opened this issue 1 year ago • 6 comments

🚀 The feature, motivation and pitch

As we all know, GPT-4o is an end2end multi-modal models, which support Speech to Text/Speech. I have some ideas about it:

Speech to Text: Can we have a try by combining the pretrained ASR encoder and a trainable linear projection to make Speech to Text possible?
Speech to Speech: Align the pretrained ASR decoder with the main LLM backbone.

Alternatives

No response

Additional context

No response

May 21 '24 05:05 Irvingao

For your first idea, I think the asr example have done it.

May 21 '24 15:05 byrTony-Frankzyq

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

May 21 '24 16:05 Irvingao

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right?

May 21 '24 16:05 byrTony-Frankzyq

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right? Though not fully understand

Exactly.

May 21 '24 16:05 Irvingao

Are you talking about ASR for the speech-to-text task? If so, you can try our ASR example.

We may support speech-to-speech in the future, but as this task is much more difficult than ASR or TTS, it is more like combining these two seamlessly. Thank you for your advice; we will take it into consideration.

If you have any further questions or need additional assistance, feel free to ask!

May 22 '24 01:05 zszheng147

I used the SLAM framework to fine-tune the inference results. Why are the test results on librispeech not as good as directly using the whisper open source model?

Jul 24 '24 06:07 Learneducn

I found one that supports both S2T and S2S simultaneously: https://github.com/MooreThreads/MooER

Nov 07 '24 07:11 gpt4o-tech

SLAM-LLM SLAM-LLM copied to clipboard

Do you have any plan about Speech to Text or Speech to Speech End2End models?

🚀 The feature, motivation and pitch

Alternatives

Additional context

SLAM-LLM
SLAM-LLM copied to clipboard