StreamAssist icon indicating copy to clipboard operation
StreamAssist copied to clipboard

Continued Conversation with Context?

Open vash2695 opened this issue 1 year ago • 9 comments

Loving this integration so far! I'm not sure if this is the right link in my particular chain, but I'm looking for some way of extending some of the capabilities of the voice assistant functionality and this seems like a good place to start!

For starters, here's what I'm using in my pipeline: Openwakeword > HASS cloud STT > OpenAI Extended > Elevenlabs TTS

This setup is fast(ish) and smart, but losing the thread on each wake is highly inefficient and limits functionality in many ways. Since we know that the chat interface can maintain a conversation within a single session, I'd like to know if it's possible to do the same with voice interaction. I think this would involve two parts:

  1. Add the option to bypass the wake word on subsequent interactions after an initial activation a. Stops if no text is detected or specific phrases like "Thank you" or "Nevermind" are used
  2. After interaction has stopped, the conversation thread should be kept open for a certain amount of time to maintain continuity and avoid the need to resend entity information if another activation takes place within that window a. Doesn't need to be long, maybe a minute or so after the last interaction ends

So, is there a way to implement this within StreamAssist or would this functionality need to be configured elsewhere? I'm by no means an experienced developer but I may know enough to help with figuring some of this out!

vash2695 avatar Jun 20 '24 16:06 vash2695

I've been working on implementing this functionality in a fork of this project and I've made some progress! I'm now stuck on finding a way to know exactly when TTS playback ends so that follow up interactions don't get triggered early. Feel free to use whatever you'd like from my fork, but be aware that I'm pretty new to all this so there may be mistakes!

https://github.com/vash2695/StreamAssistCC/

vash2695 avatar Jul 15 '24 19:07 vash2695

So I found a way to introduce an accurate TTS duration estimate, but implementing it requires the internal_event_callback to be asynchronous. This wouldn't be a big deal except for the fact that it also breaks cancellation phrases and the ability to make the conversation id persistent between interactions 🥲 Making progress one little bit at a time!

vash2695 avatar Jul 19 '24 14:07 vash2695

@vash2695 I'm interested in the TTS duration estimate. Can you share this bit or point me to it? I have a project (View Assist) that displays visual Assist responses and then a timeout goes back to the 'home' page. I am finding long TTS responses send the user back home before the TTS ends. Not the best look and this could help. Thanks.

dinki avatar Aug 16 '24 13:08 dinki

Hey @dinki! Sure thing, the code that performs this is in the Core init.py file:

First, mutagen is what is used to perform this, so you'll need to import it: from mutagen.mp3 import MP3

Then, this is the function that performs the estimation using the cached TTS audio:

async def get_tts_duration(hass: HomeAssistant, tts_url: str) -> float:
    try:
        # Ensure we have the full URL
        if tts_url.startswith('/'):
            base_url = get_url(hass)
            full_url = f"{base_url}{tts_url}"
        else:
            full_url = tts_url

        # Use Home Assistant's aiohttp client session
        session = async_get_clientsession(hass)
        async with session.get(full_url) as response:
            if response.status != 200:
                _LOGGER.error(f"Failed to fetch TTS audio: HTTP {response.status}")
                return 0
            
            content = await response.read()

        # Use mutagen to get the duration
        audio = MP3(io.BytesIO(content))
        duration = audio.info.length
        
        return duration

    except Exception as e:
        _LOGGER.error(f"Error getting TTS duration: {e}")
        return 0

And finally, you would need to call this function in any async function that executes after the TTS audio has been received:

async def your_function
      duration = await get_tts_duration(hass, tts_url)
      events[PipelineEventType.TTS_END]["data"]["tts_duration"] = duration
      _LOGGER.debug(f"Stored TTS duration: {duration} seconds")
      # Set a timer to perform additional actions based on the calculated duration plus a small delay
      await asyncio.sleep(duration)
      await asyncio.sleep(1)  # Additional small delay

I've confirmed that this works for both cloud and local TTS integrations, so it should be almost universally applicable.

I'm still figuring things out with python but I'm confident that there are many different potential use cases for this, for example you could store that duration and pass it to an automation as a variable. Unfortunately my time to work on this is quite limited but I'm hoping to get more help by posting on the community forums.

vash2695 avatar Aug 16 '24 15:08 vash2695

Thanks for sharing!

I'm still figuring things out with python but I'm confident that there are many different potential use cases for this, for example you could store that duration and pass it to an automation as a variable. Unfortunately my time to work on this is quite limited but I'm hoping to get more help by posting on the community forums.

Yes, this is the use I would have for it. Knowing how long the TTS response is would be extremely helpful in performing a few different options.

dinki avatar Aug 16 '24 18:08 dinki

according to this video stream assist should just use always the same conversation ID and it will automatically support conversations

IoSonoAndreaZ avatar Feb 14 '25 08:02 IoSonoAndreaZ

Yes. That's right. I tested with Conversation_id: 12345678 and LLMs get the conversation history.

relust avatar Feb 14 '25 08:02 relust

Check this https://github.com/AlexxIT/StreamAssist/issues/66

AlexxIT avatar Apr 24 '25 19:04 AlexxIT

Wow! Really cool. Thanks for the effort you put into all of this!

dinki avatar Apr 24 '25 19:04 dinki