StreamAssist icon indicating copy to clipboard operation
StreamAssist copied to clipboard

(bug?) StreamAssist always rush from STT to INTENT before user voice end.

Open itispip opened this issue 1 year ago • 7 comments

If set a STT start media, for exmple a wav file of "Beep", after the beep the STT will end voice recogonition almost immediately & move to intent stage, leave not time for user's real voice input;

If that STT start media happened not working, then it will wait for user voice input. But still, if user's voice sentence is long, very each to move to intent stage before user ending.

I'm think StreamAssist's mechnism to judge end of user voice is too strict.

However, found an advantage of the bug: by using a "beep" to stop STT immediately after WakeUp, user can write an automate to take over control from here, and loop StreamAssist.run service from starting phase STT to end phase TSS, achieve the purchase of continuous converstaion until a condition is met (for me, I set that condition as response_speech not containing a Question mark)

itispip avatar Jun 29 '24 19:06 itispip

That's right. please update this issue.

ghost avatar Aug 16 '24 01:08 ghost

this!!! so much this! Is it not at all possible in the current implementation to add some kind of delay between the wakeword being said and the start (and immediate giving up of listening if you don't speak as soon as the wake word has been uttered!? You have to say the wake word and command/question in one full quick sentence which kinda goes against every voice assistant that's existed in the last decade. There seems to be a way of doing it within the "assist device" devices but seeing that stream assist oddly doesn't create itself as an assist device, i don't know what needs to change to be less aggressive with the timing.

thefunkygibbon avatar Dec 02 '24 22:12 thefunkygibbon

I have the same issue. I have to talk over the beep and loud enough so that assist can detect my voice command. I am using a Tapo C200 camera. I will try with another camera to see if a different camera makes a difference. But I agree with itispip that there is not enough delay between stt-vad-start and stt-vad-end

  • type: stt-start data: engine: stt.faster_whisper_2 metadata: language: en format: wav codec: pcm bit_rate: 16 sample_rate: 16000 channel: 1 timestamp: "2024-12-12T18:33:38.833777+00:00"
  • type: stt-vad-start data: timestamp: 373410 timestamp: "2024-12-12T18:33:38.894963+00:00"
  • type: stt-vad-end data: timestamp: 374310 timestamp: "2024-12-12T18:33:39.762615+00:00"
  • type: error data: code: stt-no-text-recognized message: No text recognized timestamp: "2024-12-12T18:33:39.996446+00:00"
  • type: run-end data: null timestamp: "2024-12-12T18:33:39.996931+00:00"

BeebleZap avatar Dec 10 '24 19:12 BeebleZap

I agree, it is a deal breaker IMHO. Users will normally wait for a signal (either audible or visual) to provide the prompt. In the current implementation, STT stops listening way before the user can even start.

shaiger avatar Jan 04 '25 17:01 shaiger

I've been trying to improve this for hours. unfortunately no luck so far.

Doesn't anyone have an idea?

collateral87 avatar Jan 04 '25 19:01 collateral87

At the moment I put together a horrible hack, that hopefully I'll find a way to retire very soon: I have an automation that triggers based on the stream assist wakeword entity, this automation will then "press" the assist button in the designated dashboard/device. This way I also get to have the STT window to present both the prompt and the response which I find to be useful (UX wise, it also makes it clear when to start talking). VERY rarely stream assist will actually make it to STT following the original wakeword, in such cases it ill interfere with the new "press" trigger. But this almost never happens. I hate this hack, but the damn thing works.

shaiger avatar Jan 28 '25 15:01 shaiger

@shaiger please paste your hack here :)

PetePeter avatar Jan 29 '25 12:01 PetePeter