SLAM-LLM icon indicating copy to clipboard operation
SLAM-LLM copied to clipboard

Do you have any plan about Speech to Text or Speech to Speech End2End models?

Open Irvingao opened this issue 1 year ago • 6 comments

🚀 The feature, motivation and pitch

As we all know, GPT-4o is an end2end multi-modal models, which support Speech to Text/Speech. I have some ideas about it:

  1. Speech to Text: Can we have a try by combining the pretrained ASR encoder and a trainable linear projection to make Speech to Text possible?
  2. Speech to Speech: Align the pretrained ASR decoder with the main LLM backbone.

Alternatives

No response

Additional context

No response

Irvingao avatar May 21 '24 05:05 Irvingao

For your first idea, I think the asr example have done it.

byrTony-Frankzyq avatar May 21 '24 15:05 byrTony-Frankzyq

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Irvingao avatar May 21 '24 16:05 Irvingao

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right?

byrTony-Frankzyq avatar May 21 '24 16:05 byrTony-Frankzyq

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right? Though not fully understand

Exactly.

Irvingao avatar May 21 '24 16:05 Irvingao

Are you talking about ASR for the speech-to-text task? If so, you can try our ASR example.

We may support speech-to-speech in the future, but as this task is much more difficult than ASR or TTS, it is more like combining these two seamlessly. Thank you for your advice; we will take it into consideration.

If you have any further questions or need additional assistance, feel free to ask!

zszheng147 avatar May 22 '24 01:05 zszheng147

I used the SLAM framework to fine-tune the inference results. Why are the test results on librispeech not as good as directly using the whisper open source model?

Learneducn avatar Jul 24 '24 06:07 Learneducn

I found one that supports both S2T and S2S simultaneously: https://github.com/MooreThreads/MooER

gpt4o-tech avatar Nov 07 '24 07:11 gpt4o-tech