vocode-python [Feature]: Multimodal LLM support

Brief Description

Obviously the end-game here are multimodal LLMs instead of using a cascaded approach. But we are not quite there yet.

There are however interesting options that are multimodal but not fully end to end. E.g. gazelle or ultravox that are doing voice2text. So basically you use a VAD and pass the audio data to the LLM (optionally together with text) and get a text response that can be piped to a synthesizer.

My question is: how would we cleanly integrate this in the current vocode framework? As there seems to be an explicit separation of concerns with transcriber.->agent->synthesizer

Rationale

.

Suggested Implementation

No response

Jul 23 '24 20:07 petergerten

I found it rather difficult to extend Vocode to do that.

Jul 25 '24 13:07 stillmatic

Completely agree on the end game being multimodal. We're working on it and will have first class support for different model architectures

@stillmatic gazelle integration would be awesome :)

Jul 25 '24 18:07 Kian1354

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 24 '24 02:09 github-actions[bot]

This issue has been automatically closed due to inactivity. Thank you for your contributions.

Oct 01 '24 02:10 github-actions[bot]