StreamAssist
StreamAssist copied to clipboard
Continued Conversation with Context?
Loving this integration so far! I'm not sure if this is the right link in my particular chain, but I'm looking for some way of extending some of the capabilities of the voice assistant functionality and this seems like a good place to start!
For starters, here's what I'm using in my pipeline: Openwakeword > HASS cloud STT > OpenAI Extended > Elevenlabs TTS
This setup is fast(ish) and smart, but losing the thread on each wake is highly inefficient and limits functionality in many ways. Since we know that the chat interface can maintain a conversation within a single session, I'd like to know if it's possible to do the same with voice interaction. I think this would involve two parts:
- Add the option to bypass the wake word on subsequent interactions after an initial activation a. Stops if no text is detected or specific phrases like "Thank you" or "Nevermind" are used
- After interaction has stopped, the conversation thread should be kept open for a certain amount of time to maintain continuity and avoid the need to resend entity information if another activation takes place within that window a. Doesn't need to be long, maybe a minute or so after the last interaction ends
So, is there a way to implement this within StreamAssist or would this functionality need to be configured elsewhere? I'm by no means an experienced developer but I may know enough to help with figuring some of this out!
I've been working on implementing this functionality in a fork of this project and I've made some progress! I'm now stuck on finding a way to know exactly when TTS playback ends so that follow up interactions don't get triggered early. Feel free to use whatever you'd like from my fork, but be aware that I'm pretty new to all this so there may be mistakes!
https://github.com/vash2695/StreamAssistCC/
So I found a way to introduce an accurate TTS duration estimate, but implementing it requires the internal_event_callback to be asynchronous. This wouldn't be a big deal except for the fact that it also breaks cancellation phrases and the ability to make the conversation id persistent between interactions 🥲 Making progress one little bit at a time!
@vash2695 I'm interested in the TTS duration estimate. Can you share this bit or point me to it? I have a project (View Assist) that displays visual Assist responses and then a timeout goes back to the 'home' page. I am finding long TTS responses send the user back home before the TTS ends. Not the best look and this could help. Thanks.
Hey @dinki! Sure thing, the code that performs this is in the Core init.py file:
First, mutagen is what is used to perform this, so you'll need to import it:
from mutagen.mp3 import MP3
Then, this is the function that performs the estimation using the cached TTS audio:
async def get_tts_duration(hass: HomeAssistant, tts_url: str) -> float:
try:
# Ensure we have the full URL
if tts_url.startswith('/'):
base_url = get_url(hass)
full_url = f"{base_url}{tts_url}"
else:
full_url = tts_url
# Use Home Assistant's aiohttp client session
session = async_get_clientsession(hass)
async with session.get(full_url) as response:
if response.status != 200:
_LOGGER.error(f"Failed to fetch TTS audio: HTTP {response.status}")
return 0
content = await response.read()
# Use mutagen to get the duration
audio = MP3(io.BytesIO(content))
duration = audio.info.length
return duration
except Exception as e:
_LOGGER.error(f"Error getting TTS duration: {e}")
return 0
And finally, you would need to call this function in any async function that executes after the TTS audio has been received:
async def your_function
duration = await get_tts_duration(hass, tts_url)
events[PipelineEventType.TTS_END]["data"]["tts_duration"] = duration
_LOGGER.debug(f"Stored TTS duration: {duration} seconds")
# Set a timer to perform additional actions based on the calculated duration plus a small delay
await asyncio.sleep(duration)
await asyncio.sleep(1) # Additional small delay
I've confirmed that this works for both cloud and local TTS integrations, so it should be almost universally applicable.
I'm still figuring things out with python but I'm confident that there are many different potential use cases for this, for example you could store that duration and pass it to an automation as a variable. Unfortunately my time to work on this is quite limited but I'm hoping to get more help by posting on the community forums.
Thanks for sharing!
I'm still figuring things out with python but I'm confident that there are many different potential use cases for this, for example you could store that duration and pass it to an automation as a variable. Unfortunately my time to work on this is quite limited but I'm hoping to get more help by posting on the community forums.
Yes, this is the use I would have for it. Knowing how long the TTS response is would be extremely helpful in performing a few different options.
according to this video stream assist should just use always the same conversation ID and it will automatically support conversations
Yes. That's right. I tested with Conversation_id: 12345678 and LLMs get the conversation history.
Check this https://github.com/AlexxIT/StreamAssist/issues/66
Wow! Really cool. Thanks for the effort you put into all of this!