[Feature]: Multimodal LLM support
Brief Description
Obviously the end-game here are multimodal LLMs instead of using a cascaded approach. But we are not quite there yet.
There are however interesting options that are multimodal but not fully end to end. E.g. gazelle or ultravox that are doing voice2text. So basically you use a VAD and pass the audio data to the LLM (optionally together with text) and get a text response that can be piped to a synthesizer.
My question is: how would we cleanly integrate this in the current vocode framework? As there seems to be an explicit separation of concerns with transcriber.->agent->synthesizer
Rationale
.
Suggested Implementation
No response
I found it rather difficult to extend Vocode to do that.
Completely agree on the end game being multimodal. We're working on it and will have first class support for different model architectures
@stillmatic gazelle integration would be awesome :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Thank you for your contributions.