[Main UI] Voice support
Hello,
I want to add to the UI a client for the websocket protocol added by these PR https://github.com/openhab/openhab-core/pull/4032 to use the voice features from the UI.
I upgraded Webpack to v5 in these PR #2267 due to the its improved asset handling and over it I have created a functional POC that I will publish as a WIP PR so you can take a look to the code added.
The highlight design:
I have a class AudioMain, this class is responsable for:
- Initialize a webworker (AudioWorker) that holds the persistent websocket connection.
- Setup AudioWorkets and transfer its MessagePort to the AudioWorker.
- Notify UI about connection state, source state and sink state.
- Enable client keyword spotting functionality and propagate the detection to the worker.
Then the AudioWorker is in charge of:
- Create a persistent connection to the audio-pcm websocket.
- Instruct AudioMain to setup the source or sinks and transfer their MessagePort to the AudioWorker.
- Transfer the audio from the WebSocket to the appropriate MessagePort or viceversa, doing reencode and resample to the audio data as needed.
- Propagate to AudioMain the connection state changes.
I need a lot of feedback.
Initially I have placed the mentioned code under src/js/voice and I have added the options to the src/components/theme-switcher.vue. Also I have created a src/components/dialog-mixin.js for the src/components/app.js and from there the initialization is done when the option is enabled.
I will like to add some kind of icon that inform the user about the recording state and also that can be used to trigger the dialog by clicking on it, but I don't know where to place it or what kind of design I should use.
I will open the WIP PR when I have a moment and mention these issue because I think it will be easier to comment over what it's done.
Best Regards
POC mobile video: https://github.com/openhab/openhab-webui/assets/9007708/987e3fac-2ea8-43ae-98f3-83e646add580
Initially I have placed the mentioned code under src/js/voice and I have added the options to the src/components/theme-switcher.vue. Also I have created a src/components/dialog-mixin.js for the src/components/app.js and from there the initialization is done when the option is enabled.
That sounds good! I would just consider renaming the theme switcher as it is doing much more than only switching themes now …
I will like to add some kind of icon that inform the user about the recording state and also that can be used to trigger the dialog by clicking on it, but I don't know where to place it or what kind of design I should use.
IIRC the bottom left corner of the UI is free, so a floating action button (FAB) could be added there to manually trigger the dialog.
BTW, as a first step and as long as the core PR is waiting for review, I can imagine having the mentioned FAB which opens a popup containing a chat-like UI, where one could chat instead of talk to openHAB. @GiviMAD WDYT? If all required core APIs for this are in place I would be happy to accept a PR and review it relatively quick.
Initially I have placed the mentioned code under src/js/voice and I have added the options to the src/components/theme-switcher.vue. Also I have created a src/components/dialog-mixin.js for the src/components/app.js and from there the initialization is done when the option is enabled.
That sounds good! I would just consider renaming the theme switcher as it is doing much more than only switching themes now …
I will like to add some kind of icon that inform the user about the recording state and also that can be used to trigger the dialog by clicking on it, but I don't know where to place it or what kind of design I should use.
IIRC the bottom left corner of the UI is free, so a floating action button (FAB) could be added there to manually trigger the dialog.
BTW, as a first step and as long as the core PR is waiting for review, I can imagine having the mentioned FAB which opens a popup containing a chat-like UI, where one could chat instead of talk to openHAB. @GiviMAD WDYT? If all required core APIs for this are in place I would be happy to accept a PR and review it relatively quick.
Hey @florian-h05 thank you for the response. I haven't manage to find the time to look at this recently, I need to review the state of it.
I like the idea of having a popup with a chat UI, but I think that is better to make some changes on the dialog processor and the interpreter interface so it can keep the conversation for the session and pass it on each execution of the interpreter, that way will be relativity easy to have ChatGPT or Ollama interpreters that can be chained to the Chat UI or the voice system.
Let me know if you thing the idea makes sense. I think I can use the related core PR to introduce those changes.
Hi @GiviMAD,
thanks for your response and good new year!
I have imagined having something like https://framework7.io/docs/messages, where you can chat with „openHAB“ (in reality it will be a LLM like GPT) and which is able to show the default card widget for an Item if appropriate (in addition to answering in text), so that it can answer with an Item state or display the Item state as confirmation of an action. The enhanced ChatGPT binding already provides the functionality to control Items through HLI.
To integrate more LLM providers the same way, the code to get the Items and command them (tool/function calling) should probably migrated to core to be shared across multiple LLM providers. New core APIs need the ability to have a HLI not answer with text, but also provide additional information in a standardised format, such as which Item state to display.
Wrt to the session handling: If the chat functionality uses WS, the server can keep the session, but I’m not sure this is the best way. Once the client loses WS connection, the server cleans up the session and the chat history is lost. I would rather store the chat history on the client in local browser storage, the server can still take care of the chat memory, i.e. passing the last two or three message to the LLM. Ultimately the question is whether to implement this new chat functionality based on WS or REST — I would have chosen REST and extend the existing endpoints.
I have imagined having something like https://framework7.io/docs/messages, where you can chat with „openHAB“ (in reality it will be a LLM like GPT) and which is able to show the default card widget for an Item if appropriate (in addition to answering in text), so that it can answer with an Item state or display the Item state as confirmation of an action. The enhanced ChatGPT binding already provides the functionality to control Items through HLI.
Isn't this just HABot with a smarter LLM instead of the relatively limited model HABot uses? Maybe there's a good bit of reuse that can be pulled from there.
UI-wise, yes. But I haven’t planed to make use of the semantic model, through it wouldn’t be complicated to inject some of the semantics into the prompt, and it works very different, as of HABots logic is merely done by the LLM using function/tool calling.
I haven’t really checked out the HABot code, but HABot is build with Quasar (Vue) whereas Main UI is built with Framework7 (Vue), and wrt to the backend my assessment is that large parts of the code are already in place inside the ChatGPT binding, what’s missing is to pull the function calling code into core so other LLM add-ons can use it as well, and to extend the REST API and core APIs accordingly. I have to check, but might already be the case that I played around with this a few weeks/month ago and have some code laying around.
Hey @florian-h05, happy new year to you also!
Can we create a separate issue to chat about that idea? And we can add the enhanced ChatGPT binding there, because yes as far as I understand we should be able to have an internal representation for the function call definitions and the conversation history and expose them to OpenAI or a different provider, I have seen other tools that allow to switch between OpenAI and Ollama for example.
I prefer moving it to another issue and let this one as a basic consumer for the new websocket in case I can close it sooner.
Wrt to the session handling: If the chat functionality uses WS, the server can keep the session, but I’m not sure this is the best way. Once the client loses WS connection, the server cleans up the session and the chat history is lost. I would rather store the chat history on the client in local browser storage, the server can still take care of the chat memory, i.e. passing the last two or three message to the LLM. Ultimately the question is whether to implement this new chat functionality based on WS or REST — I would have chosen REST and extend the existing endpoints.
I'm sorry I explained myself wrong, I wasn't talking about the content of these PRs. I was talking about how the dialog is managed right now in the core. Because right now now we have the DialogProcessor in the core voice bundle, and it's managing the dialog interaction, and for me it seems like something we can transform and connect both things (for example I like the idea of been able to watch the conversation you are having with a speaker (or with the UI using this PR) displayed in that chat component and been able to continue by text if I want). For me it seems like if we modify the dialog processor to persist a temporal history and to accept a text instead of just voice we can merge both functionalities and have something cool without creating too much new things, at least seems like something wroth to explore. Let me know want you think if you have a chance to take a look.
Can we create a separate issue to chat about that idea?
Sure, we went a bit off-topic here, see #2995. I answered to you there.