SLAM-LLM
SLAM-LLM copied to clipboard
Do you have any plan about Speech to Text or Speech to Speech End2End models?
🚀 The feature, motivation and pitch
As we all know, GPT-4o is an end2end multi-modal models, which support Speech to Text/Speech. I have some ideas about it:
- Speech to Text: Can we have a try by combining the pretrained ASR encoder and a trainable linear projection to make Speech to Text possible?
- Speech to Speech: Align the pretrained ASR decoder with the main LLM backbone.
Alternatives
No response
Additional context
No response
For your first idea, I think the asr example have done it.
For your first idea, I think the asr example have done it.
I main speech inputs with LLM outputs.
For your first idea, I think the asr example have done it.
I main speech inputs with LLM outputs.
Your "text" means response, right?
For your first idea, I think the asr example have done it.
I main speech inputs with LLM outputs.
Your "text" means response, right? Though not fully understand
Exactly.
Are you talking about ASR for the speech-to-text task? If so, you can try our ASR example.
We may support speech-to-speech in the future, but as this task is much more difficult than ASR or TTS, it is more like combining these two seamlessly. Thank you for your advice; we will take it into consideration.
If you have any further questions or need additional assistance, feel free to ask!
I used the SLAM framework to fine-tune the inference results. Why are the test results on librispeech not as good as directly using the whisper open source model?
I found one that supports both S2T and S2S simultaneously: https://github.com/MooreThreads/MooER