feature-requests
feature-requests copied to clipboard
Voice Assistant: allow audio feedback tones on events
Describe the problem you have/What new integration you would like It would be nice if the voice assistant was able to beep if the wake-word got detected (for example). Maybe also on other states.
Please describe your use case for this integration and alternatives you've tried:
Currently, I try to intercept the trigger on_wake_word_detected
and tell it to play a sound (see below). But it doesn't work that way. Also esphome doesn't wait for the on_... to be finished, even if I add something like a delay. All the subsequent events fire milliseconds later. Lighting up LEDs works, though. Or media_player.play_media outside of the voice_assistant events.
Additional context
i2s_audio:
[...]
media_player:
[...]
microphone:
[...]
voice_assistant:
on_wake_word_detected:
- media_player.play_media:
id: wm8978_audio
media_url: 'http://192.168.1.21:8123/local/sounds/receive.wav'
That would be extremely useful. Today I have to look for the listen light to make sure I'm not talking to myself.
I guess it is really important to have feedback from voice assistant after wake word detection.
@h3ndrik I have no hardware for testing at this moment. Can you try something like this?
voice_assistant:
on_wake_word_detected:
- homeassistant.service:
media_player.play_media:
data:
entity_id: media_player.any_media_player_in_ha
media_url: 'http://192.168.1.21:8123/local/sounds/receive.wav'
media_content_type: music
announce: "true"
Additional info: You need to go to the ESPhome integration in HA and "configure" your device and allow it to execute services. I've tried the following:
on_wake_word_detected:
- lambda: ESP_LOGD("voice_assistant", "TRIGGER on_wake_word_detected");
- homeassistant.service:
service: media_player.play_media
data:
#entity_id: media_player.any_media_player_in_ha
entity_id: media_player.taudio_taudio_i2saudio
media_url: 'http://192.168.1.21:8123/local/sounds/receive.wav'
media_content_type: music
announce: "true"
[13:49:59][D][voice_assistant:240]: VAD detected speech
[13:49:59][D][voice_assistant:438]: State changed from WAITING_FOR_VAD to START_PIPELINE
[13:49:59][D][voice_assistant:444]: Desired state set to STREAMING_MICROPHONE
[13:49:59][D][voice_assistant:256]: Requesting start...
[13:49:59][D][voice_assistant:438]: State changed from START_PIPELINE to STARTING_PIPELINE
[13:49:59][D][voice_assistant:459]: Client started, streaming microphone
[13:49:59][D][voice_assistant:438]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[13:49:59][D][voice_assistant:444]: Desired state set to STREAMING_MICROPHONE
[13:49:59][D][voice_assistant:542]: Event Type: 1
[13:49:59][D][voice_assistant:545]: Assist Pipeline running
[13:49:59][D][voice_assistant:465]: TRIGGER on_start
[13:49:59][D][voice_assistant:542]: Event Type: 9
[13:50:02][D][voice_assistant:542]: Event Type: 10
[13:50:02][D][voice_assistant:551]: Wake word detected
[13:50:02][D][voice_assistant:489]: TRIGGER on_wake_word_detected
[13:50:02][D][voice_assistant:542]: Event Type: 3
[13:50:02][D][voice_assistant:556]: STT started
[13:50:02][D][voice_assistant:467]: TRIGGER on_listening
[13:50:02][D][light:036]: 'NeoPixel Light' Setting:
[13:50:02][D][light:051]: Brightness: 50%
[13:50:02][D][light:059]: Red: 0%, Green: 0%, Blue: 100%
[13:50:02][D][light:109]: Effect: 'Slow Pulse'
[13:50:02][D][media_player:059]: 'TAudio I2SAudio' - Setting
[13:50:02][D][media_player:063]: Command: STOP
[13:50:04][D][voice_assistant:542]: Event Type: 11
[13:50:04][D][voice_assistant:680]: Starting STT by VAD
[13:50:04][D][voice_assistant:478]: TRIGGER on_stt_vad_start
[13:50:06][D][voice_assistant:542]: Event Type: 12
[13:50:06][D][voice_assistant:684]: STT by VAD end
[13:50:06][D][voice_assistant:480]: TRIGGER on_stt_vad_end
[13:50:06][D][light:036]: 'NeoPixel Light' Setting:
[13:50:06][D][light:051]: Brightness: 50%
[13:50:06][D][light:059]: Red: 0%, Green: 0%, Blue: 100%
[13:50:06][D][light:109]: Effect: 'Fast Pulse'
[13:50:07][D][voice_assistant:542]: Event Type: 4
[13:50:07][D][voice_assistant:438]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[13:50:07][D][voice_assistant:444]: Desired state set to AWAITING_RESPONSE
[13:50:07][D][voice_assistant:571]: Speech recognised as: "das Wohnzimmerlicht ein"
[13:50:07][D][voice_assistant:511]: TRIGGER on_stt_end
[13:50:07][D][voice_assistant:438]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[13:50:07][D][voice_assistant:542]: Event Type: 5
[13:50:07][D][voice_assistant:576]: Intent started
[13:50:07][D][voice_assistant:542]: TRIGGER on_intent_start
[13:50:07][D][voice_assistant:438]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[13:50:07][D][voice_assistant:542]: Event Type: 6
[13:50:07][D][voice_assistant:544]: TRIGGER on_intent_end
[13:50:07][D][voice_assistant:542]: Event Type: 7
[13:50:07][D][voice_assistant:599]: Response: "wohnzimmerlicht eingeschaltet"
[13:50:07][D][voice_assistant:546]: TRIGGER on_tts_start
[13:50:07][D][light:036]: 'NeoPixel Light' Setting:
[13:50:07][D][light:051]: Brightness: 50%
[13:50:07][D][light:059]: Red: 0%, Green: 100%, Blue: 0%
[13:50:07][D][light:109]: Effect: 'None'
[13:50:07][D][voice_assistant:542]: Event Type: 8
[13:50:07][D][voice_assistant:617]: Response URL: "http://192.168.1.21:8123/api/tts_proxy/6d14c05c27e38b28db1144b87cf987e0ce862b8b_de-de_60afa496b6_tts.piper.raw"
[13:50:07][D][media_player:059]: 'TAudio I2SAudio' - Setting
[13:50:07][D][media_player:066]: Media URL: http://192.168.1.21:8123/api/tts_proxy/6d14c05c27e38b28db1144b87cf987e0ce862b8b_de-de_60afa496b6_tts.piper.raw
[13:50:07][D][voice_assistant:438]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[13:50:07][D][voice_assistant:444]: Desired state set to STREAMING_RESPONSE
[13:50:07][D][voice_assistant:568]: TRIGGER on_tts_end
[13:50:07][W][component:214]: Component api took a long time for an operation (0.06 s).
[13:50:07][W][component:215]: Components should block for at most 20-30ms.
[13:50:08][W][component:214]: Component i2s_audio.media_player took a long time for an operation (0.53 s).
[13:50:08][W][component:215]: Components should block for at most 20-30ms.
[13:50:08][D][voice_assistant:542]: Event Type: 2
[13:50:08][D][voice_assistant:629]: Assist Pipeline ended
[13:50:08][D][voice_assistant:586]: TRIGGER on_end
[13:50:11][W][component:214]: Component i2s_audio.media_player took a long time for an operation (0.51 s).
[13:50:11][W][component:215]: Components should block for at most 20-30ms.
[13:50:11][W][component:214]: Component i2s_audio.media_player took a long time for an operation (0.50 s).
[13:50:11][W][component:215]: Components should block for at most 20-30ms.
[13:50:11][D][light:036]: 'NeoPixel Light' Setting:
[13:50:11][D][light:051]: Brightness: 40%
[13:50:11][D][light:059]: Red: 100%, Green: 89%, Blue: 71%
[13:50:13][D][voice_assistant:438]: State changed from STREAMING_RESPONSE to IDLE
[13:50:13][D][voice_assistant:444]: Desired state set to IDLE
It runs my ESP_LOGD but the service call gets lost.
@h3ndrik thanks a lot for the detailed answer. It looks like that ESPHome software must be changed for this purpose. Somewhere between TRIGGER on_wake_word_detected and TRIGGER on_listening must be added an confirmation event with possibility to output to the speaker or an media player.
[...] looks like that ESPHome software must be changed for this purpose [...]
Yeah, I think so, too. I don't have a good idea on how to implement it. Seems as of now the voice_assistant just uses Component::defer() to schedule the task and moves on with the state machine logic.
I'm not sure if waiting / blocking has downsides for other users. But maybe we should come up with a solution that works with all the triggers. I could image another (short) beep being useful once it stops listening. And I'd maybe like to replace the whole returned tts answer with a chime. But the last one isn't necessarily a valid use-case.
It'd definitely be nice to stay within one mode of communication (audio) and not also have to look at a status led somewhere to see if it heard you and which state it is in.
@h3ndrik I'm not familiar with C++, but based on my experience in other programming languages, after reviewing this code: https://github.com/esphome/esphome/blob/dev/esphome/components/voice_assistant/voice_assistant.cpp
I think it will be enough to add these lines after the line 538 As well we need to define url value to desired confirmation sound.
#ifdef USE_MEDIA_PLAYER
if (this->media_player_ != nullptr) {
this->media_player_->make_call().set_media_url(url).perform();
}
#endif
I think it will be enough to add these lines after the line 538
That perform() seems to be an async call. It shows up in the debug log but several other things happen within the same second and I think it gets overwritten or something like that. In any case nothing gets played in the end. Like with the other approach.
And I don't think this is the right place. void VoiceAssistant::on_event(const api::VoiceAssistantEventResponse &msg)
seems to handle the events from Home Assistant. And strictly speaking playing the tone isn't something HA told us to do. It's something from within the trigger where the url would be specified, too.
Maybe we need to wait for the triggered user-actions to finish, before the state-machine transitions to the next state. I believe I tried some 'wait_for' in the yaml. But the voice_assistant seems to like to listen to the home assistant events more than doing the wait I've told it to do.
@h3ndrik I'm agree that perform() seems to be an async call, but we need to set url for media player in any case. I have analyzed code once again and have another version for testing ) How about these lines after the line 538?
#ifdef USE_MEDIA_PLAYER
bool playing = false;
if (this->media_player_ != nullptr) {
this->media_player_->make_call().set_media_url("http://192.168.5.105:8123/local/sounds/confirm.mp3").perform();
playing = (this->media_player_->state == media_player::MediaPlayerState::MEDIA_PLAYER_STATE_PLAYING);
}
if (playing) {
this->set_timeout("playing", 2000, [this]() {
this->cancel_timeout("speaker-timeout");
});
}
#endif
Can you guy create a pull request for it, I need this feature :)))
@kizovinh first of all, for the any pull request it MUST be working solution ) At this moment we still trying to find it.
@kizovinh first of all, for the any pull request it MUST be working solution ) At this moment we still trying to find it.
I knew that, I just want to inspire you guys that this feature is very useful, keep going 👍
I am having another idea for working around it, is using the "assist in progress" sensor to detected a pipeline start and play media stream. Let me try
I am having another idea for working around it, is using the "assist in progress" sensor to detected a pipeline start and play media stream. Let me try
Result are the same, the speaker will play only after pipeline end
Finally, I attached to ESP32 that using Arduino framework and media_player instead of speaker, small rtttl buzzer. In configuration it looks like:
output:
- platform: ledc
id: buzzer_output
pin: GPIO13
frequency: 2000 Hz
rtttl:
output: buzzer_output
id: my_rtttl
voice_assistant:
on_wake_word_detected:
- rtttl.play: "two_short:d=4,o=5,b=100:16e6,16e6"
and it works like a charm.
Finally, I attached to ESP32 that using Arduino framework and media_player instead of speaker, small rtttl buzzer. In configuration it looks like:
output: - platform: ledc id: buzzer_output pin: GPIO13 frequency: 2000 Hz rtttl: output: buzzer_output id: my_rtttl voice_assistant: on_wake_word_detected: - rtttl.play: "two_short:d=4,o=5,b=100:16e6,16e6"
and it works like a charm.
Yep I think any general gpio will work, for me I use a servo sg90, when detect the wake word, I made the servo rotate and it produce a small sound which is hearable. But a real media player will still give better interaction
I was just going down the path of using speaker.play
to hardcode a feedback sound in the yaml (to avoid waiting on #2429), but it has the same problem where the playback gets buffered until the assistant stops its loop.
Did you know the RTTTL component can now also play sounds using the speaker component?
The problem is that the microphone continues to listen at the moment when it is necessary to play the confirmation sound. I added couple of lines to voice_assistant.cpp to stop microphone during this period of time. Overall behavior became better and the confirmation sound is no longer perceived as the beginning of a command. The system works stable, but only if the confirmation sound is output through RTTTL component. All attempts to output a confirmation sound through components: media_player, speaker or speaker as RTTTL very unstable.
I currently use two versions of the voice assistant using confirmation sounds. The first outputs it to an external media player (as STT), the second to a separate RTTTL component.
yeah, that is the issue indeed. I hope @jesserockz can find a way to auto switch between the two modes.
My changes in voice_assistant.cpp. Hope it can help:
case api::enums::VOICE_ASSISTANT_WAKE_WORD_END: {
ESP_LOGD(TAG, "Wake word detected");
// stop microphone for confirmation signal
this->mic_->stop();
this->set_state_(State::STOPPING_MICROPHONE);
this->defer([this]() { this->wake_word_detected_trigger_->trigger(); });
break;
}
case api::enums::VOICE_ASSISTANT_STT_START:
// delay to start microphone to finish confirmation signal
this->set_timeout("pause", 800, [this]() {
this->cancel_timeout("pause");
this->set_state_(State::START_MICROPHONE);
});
ESP_LOGD(TAG, "STT started");
this->defer([this]() { this->listening_trigger_->trigger(); });
break;
One thing I found: There is a Mutex lock in the i2s_audio component. That is locked and unlocked both by the microphone and the media_player. Since the microphone is always running, I suppose the media_player waits for the microphone to be stopped anyways. And that probably happens once the pipeline gets teared down and started again. Maybe that's the underlying problem.
I switch to the dev branch, and in the lastest release I think they lower the priority of media player over voice assistant. Media player play only when listening to wake word is turned off
Hi everyone, I'm facing the same problem as all of you. I'm currently using an M5Atom Echo for testing, as my ESP Wroom32 + accessories haven't arrived yet. I would also like to receive a voice response such as "Yes, how can I help?" in addition to the flashing blue LED as soon as my "Hey Jarvis" is recognised.
Have a look at this link (https://www.esp-voice.com/espvoice/), I have no idea what to make of it, especially as 45 money is already a lot. But the thing has a response after recognising the wake word....
I could really use this, too!!!
@Chrissi02 have you looked at the esphome docs? You should be able to hook into one of the optional automation steps like on_wake_word_detected
and have it say something: https://esphome.io/components/voice_assistant.html#configuration
Back when I last tried it the issue was that you can't do audio output in 'on_wake_word_detected' it will just not play anything.
Cripes i forgot your initial issue had on_wake_word_detected
. My bad. What im doing right now is outputting to an external speaker via homeassistant but i know that not everyone has that functionality in their ecosystem. Did you try any of the other automation steps @h3ndrik ?
Link to Discord - this may be helpful.
I am (very) new to ESPHome configuration, and still very much 'living & learning'. Wake word sound (with this code) works for me, however, (for some reason ...) I lost other functionality included in my config.
Link to Discord - this may be helpful.
I am (very) new to ESPHome configuration, and still very much 'living & learning'. Wake word sound (with this code) works for me, however, (for some reason ...) I lost other functionality included in my config.
I‘m not allowed to see this conversation, could you post the relevant part here?