feature-requests icon indicating copy to clipboard operation
feature-requests copied to clipboard

Voice Assistant: allow audio feedback tones on events

Open h3ndrik opened this issue 1 year ago • 46 comments

Describe the problem you have/What new integration you would like It would be nice if the voice assistant was able to beep if the wake-word got detected (for example). Maybe also on other states.

Please describe your use case for this integration and alternatives you've tried: Currently, I try to intercept the trigger on_wake_word_detected and tell it to play a sound (see below). But it doesn't work that way. Also esphome doesn't wait for the on_... to be finished, even if I add something like a delay. All the subsequent events fire milliseconds later. Lighting up LEDs works, though. Or media_player.play_media outside of the voice_assistant events.

Additional context

i2s_audio:
  [...]

media_player:
  [...]

microphone:
  [...]

voice_assistant:
  on_wake_word_detected:
    - media_player.play_media:
        id: wm8978_audio
        media_url: 'http://192.168.1.21:8123/local/sounds/receive.wav'

h3ndrik avatar Nov 21 '23 22:11 h3ndrik

That would be extremely useful. Today I have to look for the listen light to make sure I'm not talking to myself.

swiergot avatar Nov 24 '23 13:11 swiergot

I guess it is really important to have feedback from voice assistant after wake word detection.

@h3ndrik I have no hardware for testing at this moment. Can you try something like this?


voice_assistant:
  on_wake_word_detected:
    - homeassistant.service:
      media_player.play_media:
      data:
        entity_id: media_player.any_media_player_in_ha
        media_url: 'http://192.168.1.21:8123/local/sounds/receive.wav'
        media_content_type: music
        announce: "true"

demey avatar Nov 28 '23 22:11 demey

Additional info: You need to go to the ESPhome integration in HA and "configure" your device and allow it to execute services. I've tried the following:

  on_wake_word_detected:
    - lambda: ESP_LOGD("voice_assistant", "TRIGGER on_wake_word_detected");
    - homeassistant.service:
        service: media_player.play_media
        data:
          #entity_id: media_player.any_media_player_in_ha
          entity_id: media_player.taudio_taudio_i2saudio
          media_url: 'http://192.168.1.21:8123/local/sounds/receive.wav'
          media_content_type: music
          announce: "true"
[13:49:59][D][voice_assistant:240]: VAD detected speech
[13:49:59][D][voice_assistant:438]: State changed from WAITING_FOR_VAD to START_PIPELINE
[13:49:59][D][voice_assistant:444]: Desired state set to STREAMING_MICROPHONE
[13:49:59][D][voice_assistant:256]: Requesting start...
[13:49:59][D][voice_assistant:438]: State changed from START_PIPELINE to STARTING_PIPELINE
[13:49:59][D][voice_assistant:459]: Client started, streaming microphone
[13:49:59][D][voice_assistant:438]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[13:49:59][D][voice_assistant:444]: Desired state set to STREAMING_MICROPHONE
[13:49:59][D][voice_assistant:542]: Event Type: 1
[13:49:59][D][voice_assistant:545]: Assist Pipeline running
[13:49:59][D][voice_assistant:465]: TRIGGER on_start
[13:49:59][D][voice_assistant:542]: Event Type: 9
[13:50:02][D][voice_assistant:542]: Event Type: 10
[13:50:02][D][voice_assistant:551]: Wake word detected
[13:50:02][D][voice_assistant:489]: TRIGGER on_wake_word_detected
[13:50:02][D][voice_assistant:542]: Event Type: 3
[13:50:02][D][voice_assistant:556]: STT started
[13:50:02][D][voice_assistant:467]: TRIGGER on_listening
[13:50:02][D][light:036]: 'NeoPixel Light' Setting:
[13:50:02][D][light:051]:   Brightness: 50%
[13:50:02][D][light:059]:   Red: 0%, Green: 0%, Blue: 100%
[13:50:02][D][light:109]:   Effect: 'Slow Pulse'
[13:50:02][D][media_player:059]: 'TAudio I2SAudio' - Setting
[13:50:02][D][media_player:063]:   Command: STOP
[13:50:04][D][voice_assistant:542]: Event Type: 11
[13:50:04][D][voice_assistant:680]: Starting STT by VAD
[13:50:04][D][voice_assistant:478]: TRIGGER on_stt_vad_start
[13:50:06][D][voice_assistant:542]: Event Type: 12
[13:50:06][D][voice_assistant:684]: STT by VAD end
[13:50:06][D][voice_assistant:480]: TRIGGER on_stt_vad_end
[13:50:06][D][light:036]: 'NeoPixel Light' Setting:
[13:50:06][D][light:051]:   Brightness: 50%
[13:50:06][D][light:059]:   Red: 0%, Green: 0%, Blue: 100%
[13:50:06][D][light:109]:   Effect: 'Fast Pulse'
[13:50:07][D][voice_assistant:542]: Event Type: 4
[13:50:07][D][voice_assistant:438]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[13:50:07][D][voice_assistant:444]: Desired state set to AWAITING_RESPONSE
[13:50:07][D][voice_assistant:571]: Speech recognised as: "das Wohnzimmerlicht ein"
[13:50:07][D][voice_assistant:511]: TRIGGER on_stt_end
[13:50:07][D][voice_assistant:438]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[13:50:07][D][voice_assistant:542]: Event Type: 5
[13:50:07][D][voice_assistant:576]: Intent started
[13:50:07][D][voice_assistant:542]: TRIGGER on_intent_start
[13:50:07][D][voice_assistant:438]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[13:50:07][D][voice_assistant:542]: Event Type: 6
[13:50:07][D][voice_assistant:544]: TRIGGER on_intent_end
[13:50:07][D][voice_assistant:542]: Event Type: 7
[13:50:07][D][voice_assistant:599]: Response: "wohnzimmerlicht eingeschaltet"
[13:50:07][D][voice_assistant:546]: TRIGGER on_tts_start
[13:50:07][D][light:036]: 'NeoPixel Light' Setting:
[13:50:07][D][light:051]:   Brightness: 50%
[13:50:07][D][light:059]:   Red: 0%, Green: 100%, Blue: 0%
[13:50:07][D][light:109]:   Effect: 'None'
[13:50:07][D][voice_assistant:542]: Event Type: 8
[13:50:07][D][voice_assistant:617]: Response URL: "http://192.168.1.21:8123/api/tts_proxy/6d14c05c27e38b28db1144b87cf987e0ce862b8b_de-de_60afa496b6_tts.piper.raw"
[13:50:07][D][media_player:059]: 'TAudio I2SAudio' - Setting
[13:50:07][D][media_player:066]:   Media URL: http://192.168.1.21:8123/api/tts_proxy/6d14c05c27e38b28db1144b87cf987e0ce862b8b_de-de_60afa496b6_tts.piper.raw
[13:50:07][D][voice_assistant:438]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[13:50:07][D][voice_assistant:444]: Desired state set to STREAMING_RESPONSE
[13:50:07][D][voice_assistant:568]: TRIGGER on_tts_end
[13:50:07][W][component:214]: Component api took a long time for an operation (0.06 s).
[13:50:07][W][component:215]: Components should block for at most 20-30ms.
[13:50:08][W][component:214]: Component i2s_audio.media_player took a long time for an operation (0.53 s).
[13:50:08][W][component:215]: Components should block for at most 20-30ms.
[13:50:08][D][voice_assistant:542]: Event Type: 2
[13:50:08][D][voice_assistant:629]: Assist Pipeline ended
[13:50:08][D][voice_assistant:586]: TRIGGER on_end
[13:50:11][W][component:214]: Component i2s_audio.media_player took a long time for an operation (0.51 s).
[13:50:11][W][component:215]: Components should block for at most 20-30ms.
[13:50:11][W][component:214]: Component i2s_audio.media_player took a long time for an operation (0.50 s).
[13:50:11][W][component:215]: Components should block for at most 20-30ms.
[13:50:11][D][light:036]: 'NeoPixel Light' Setting:
[13:50:11][D][light:051]:   Brightness: 40%
[13:50:11][D][light:059]:   Red: 100%, Green: 89%, Blue: 71%
[13:50:13][D][voice_assistant:438]: State changed from STREAMING_RESPONSE to IDLE
[13:50:13][D][voice_assistant:444]: Desired state set to IDLE

It runs my ESP_LOGD but the service call gets lost.

h3ndrik avatar Nov 29 '23 12:11 h3ndrik

@h3ndrik thanks a lot for the detailed answer. It looks like that ESPHome software must be changed for this purpose. Somewhere between TRIGGER on_wake_word_detected and TRIGGER on_listening must be added an confirmation event with possibility to output to the speaker or an media player.

demey avatar Nov 29 '23 13:11 demey

[...] looks like that ESPHome software must be changed for this purpose [...]

Yeah, I think so, too. I don't have a good idea on how to implement it. Seems as of now the voice_assistant just uses Component::defer() to schedule the task and moves on with the state machine logic.

I'm not sure if waiting / blocking has downsides for other users. But maybe we should come up with a solution that works with all the triggers. I could image another (short) beep being useful once it stops listening. And I'd maybe like to replace the whole returned tts answer with a chime. But the last one isn't necessarily a valid use-case.

It'd definitely be nice to stay within one mode of communication (audio) and not also have to look at a status led somewhere to see if it heard you and which state it is in.

h3ndrik avatar Nov 29 '23 15:11 h3ndrik

@h3ndrik I'm not familiar with C++, but based on my experience in other programming languages, after reviewing this code: https://github.com/esphome/esphome/blob/dev/esphome/components/voice_assistant/voice_assistant.cpp

I think it will be enough to add these lines after the line 538 As well we need to define url value to desired confirmation sound.

#ifdef USE_MEDIA_PLAYER
        if (this->media_player_ != nullptr) {
          this->media_player_->make_call().set_media_url(url).perform();
        }
#endif

demey avatar Nov 29 '23 19:11 demey

I think it will be enough to add these lines after the line 538

That perform() seems to be an async call. It shows up in the debug log but several other things happen within the same second and I think it gets overwritten or something like that. In any case nothing gets played in the end. Like with the other approach.

And I don't think this is the right place. void VoiceAssistant::on_event(const api::VoiceAssistantEventResponse &msg) seems to handle the events from Home Assistant. And strictly speaking playing the tone isn't something HA told us to do. It's something from within the trigger where the url would be specified, too.

Maybe we need to wait for the triggered user-actions to finish, before the state-machine transitions to the next state. I believe I tried some 'wait_for' in the yaml. But the voice_assistant seems to like to listen to the home assistant events more than doing the wait I've told it to do.

h3ndrik avatar Dec 04 '23 21:12 h3ndrik

@h3ndrik I'm agree that perform() seems to be an async call, but we need to set url for media player in any case. I have analyzed code once again and have another version for testing ) How about these lines after the line 538?

#ifdef USE_MEDIA_PLAYER
      bool playing = false;
      if (this->media_player_ != nullptr) {
        this->media_player_->make_call().set_media_url("http://192.168.5.105:8123/local/sounds/confirm.mp3").perform();
        playing = (this->media_player_->state == media_player::MediaPlayerState::MEDIA_PLAYER_STATE_PLAYING);
      }
      if (playing) {
        this->set_timeout("playing", 2000, [this]() {
          this->cancel_timeout("speaker-timeout");
        });
      }
#endif

demey avatar Dec 06 '23 09:12 demey

Can you guy create a pull request for it, I need this feature :)))

kizovinh avatar Dec 06 '23 14:12 kizovinh

@kizovinh first of all, for the any pull request it MUST be working solution ) At this moment we still trying to find it.

demey avatar Dec 06 '23 14:12 demey

@kizovinh first of all, for the any pull request it MUST be working solution ) At this moment we still trying to find it.

I knew that, I just want to inspire you guys that this feature is very useful, keep going 👍

kizovinh avatar Dec 06 '23 14:12 kizovinh

I am having another idea for working around it, is using the "assist in progress" sensor to detected a pipeline start and play media stream. Let me try

kizovinh avatar Dec 06 '23 14:12 kizovinh

I am having another idea for working around it, is using the "assist in progress" sensor to detected a pipeline start and play media stream. Let me try

Result are the same, the speaker will play only after pipeline end

kizovinh avatar Dec 06 '23 15:12 kizovinh

Finally, I attached to ESP32 that using Arduino framework and media_player instead of speaker, small rtttl buzzer. In configuration it looks like:

output:
  - platform: ledc
    id: buzzer_output
    pin: GPIO13
    frequency: 2000 Hz

rtttl:
  output: buzzer_output
  id: my_rtttl

voice_assistant:
  on_wake_word_detected:
    - rtttl.play: "two_short:d=4,o=5,b=100:16e6,16e6"

and it works like a charm.

demey avatar Dec 14 '23 15:12 demey

Finally, I attached to ESP32 that using Arduino framework and media_player instead of speaker, small rtttl buzzer. In configuration it looks like:

output:
  - platform: ledc
    id: buzzer_output
    pin: GPIO13
    frequency: 2000 Hz

rtttl:
  output: buzzer_output
  id: my_rtttl

voice_assistant:
  on_wake_word_detected:
    - rtttl.play: "two_short:d=4,o=5,b=100:16e6,16e6"

and it works like a charm.

Yep I think any general gpio will work, for me I use a servo sg90, when detect the wake word, I made the servo rotate and it produce a small sound which is hearable. But a real media player will still give better interaction

kizovinh avatar Dec 14 '23 15:12 kizovinh

I was just going down the path of using speaker.play to hardcode a feedback sound in the yaml (to avoid waiting on #2429), but it has the same problem where the playback gets buffered until the assistant stops its loop.

vilhalmer avatar Dec 30 '23 15:12 vilhalmer

Did you know the RTTTL component can now also play sounds using the speaker component?

nielsnl68 avatar Dec 30 '23 16:12 nielsnl68

The problem is that the microphone continues to listen at the moment when it is necessary to play the confirmation sound. I added couple of lines to voice_assistant.cpp to stop microphone during this period of time. Overall behavior became better and the confirmation sound is no longer perceived as the beginning of a command. The system works stable, but only if the confirmation sound is output through RTTTL component. All attempts to output a confirmation sound through components: media_player, speaker or speaker as RTTTL very unstable.

I currently use two versions of the voice assistant using confirmation sounds. The first outputs it to an external media player (as STT), the second to a separate RTTTL component.

demey avatar Dec 30 '23 17:12 demey

yeah, that is the issue indeed. I hope @jesserockz can find a way to auto switch between the two modes.

nielsnl68 avatar Dec 30 '23 17:12 nielsnl68

My changes in voice_assistant.cpp. Hope it can help:

    case api::enums::VOICE_ASSISTANT_WAKE_WORD_END: {
      ESP_LOGD(TAG, "Wake word detected");
// stop microphone for confirmation signal
      this->mic_->stop();
      this->set_state_(State::STOPPING_MICROPHONE);
      this->defer([this]() { this->wake_word_detected_trigger_->trigger(); });
      break;
    }
    case api::enums::VOICE_ASSISTANT_STT_START:
// delay to start microphone to finish confirmation signal
      this->set_timeout("pause", 800, [this]() {
        this->cancel_timeout("pause");
        this->set_state_(State::START_MICROPHONE);
      });
      ESP_LOGD(TAG, "STT started");
      this->defer([this]() { this->listening_trigger_->trigger(); });
      break;

demey avatar Dec 30 '23 17:12 demey

One thing I found: There is a Mutex lock in the i2s_audio component. That is locked and unlocked both by the microphone and the media_player. Since the microphone is always running, I suppose the media_player waits for the microphone to be stopped anyways. And that probably happens once the pipeline gets teared down and started again. Maybe that's the underlying problem.

h3ndrik avatar Jan 07 '24 15:01 h3ndrik

I switch to the dev branch, and in the lastest release I think they lower the priority of media player over voice assistant. Media player play only when listening to wake word is turned off

kizovinh avatar Jan 07 '24 15:01 kizovinh

Hi everyone, I'm facing the same problem as all of you. I'm currently using an M5Atom Echo for testing, as my ESP Wroom32 + accessories haven't arrived yet. I would also like to receive a voice response such as "Yes, how can I help?" in addition to the flashing blue LED as soon as my "Hey Jarvis" is recognised.

Have a look at this link (https://www.esp-voice.com/espvoice/), I have no idea what to make of it, especially as 45 money is already a lot. But the thing has a response after recognising the wake word....

Chrissi02 avatar Jan 24 '24 13:01 Chrissi02

I could really use this, too!!!

mihaichrapan avatar Feb 02 '24 07:02 mihaichrapan

@Chrissi02 have you looked at the esphome docs? You should be able to hook into one of the optional automation steps like on_wake_word_detected and have it say something: https://esphome.io/components/voice_assistant.html#configuration

janstadt avatar Mar 07 '24 20:03 janstadt

Back when I last tried it the issue was that you can't do audio output in 'on_wake_word_detected' it will just not play anything.

h3ndrik avatar Mar 07 '24 21:03 h3ndrik

Cripes i forgot your initial issue had on_wake_word_detected. My bad. What im doing right now is outputting to an external speaker via homeassistant but i know that not everyone has that functionality in their ecosystem. Did you try any of the other automation steps @h3ndrik ?

janstadt avatar Mar 07 '24 21:03 janstadt

Link to Discord - this may be helpful.

I am (very) new to ESPHome configuration, and still very much 'living & learning'. Wake word sound (with this code) works for me, however, (for some reason ...) I lost other functionality included in my config.

smoldersonline avatar Mar 08 '24 05:03 smoldersonline

Link to Discord - this may be helpful.

I am (very) new to ESPHome configuration, and still very much 'living & learning'. Wake word sound (with this code) works for me, however, (for some reason ...) I lost other functionality included in my config.

I‘m not allowed to see this conversation, could you post the relevant part here?

derhappy avatar Mar 08 '24 06:03 derhappy

Screenshot 2024-03-08 at 07 53 18

smoldersonline avatar Mar 08 '24 06:03 smoldersonline