agents
agents copied to clipboard
Aggressive transcript mode / text response only mode
I think a common use case is to toggle between voice and text mode (like in the ChatGPT app among others).
If the goal is to create a multimodal framework that can easily toggle between modalities, it would be great to have a way to disable voice synthesis.
Right now, I am just muting the synthesized voice. This is ok for UX, but it is wasting voice synthesis costs and resulting in delayed transcripts. We could deliver the assistant responses to the user much faster if we aggressively stream the transcript rather than waiting for the timing for voice synthesis.
These two issues could be solved together or separately. I am working on how this would be handled in the framework, but would love to hear others' thoughts!