Cognitive-Services-Voice-Assistant
Cognitive-Services-Voice-Assistant copied to clipboard
SDK v1.13 draft: add KeywordRecognizer support to UWP VA
Purpose
Speech SDK v1.12 introduced a new KeywordRecognizer
object that enables standalone on-device keyword matching without an active connection to Azure Speech Services. The audio associated with results from this object can then be routed into existing objects (such as the DialogServiceConnector
) for use in existing scenarios.
This functionality has a significant benefit to voice assistant applications that may be initiated in a "cold start" situation:
- The user speaks an activating utterance and expects something happen ASAP
- The assistant application is activated in response to the "detected but not yet confirmed" keyword utterance
- Audio/app lifecycle spin-up occurs (latency hit of at least a few hundred milliseconds)
- Before connecting to the speech service, an access token must be retrieved from another off-device source (latency hit, potentially 1s or more)
- Once an access token is available,
DialogServiceConnector
won't begin processing keyword audio until a connection is established (latency hit, several hundred milliseconds) - The Speech SDK then processes the queued audio and catches up to the detected keyword (another few hundred milliseconds before the on-device result)
- Only at this point (with an on-device confirmation result available) is it appropriate for the waiting user to receive feedback
KeywordRecognizer
allows us to parallelize and skip (4) and (5) above, typically saving more than 500ms in cold start and often saving multiple seconds (depending on token retrieval and connection establishment speeds). An on-device result can be obtained in parallel to networking needs and the DialogServiceConnector
, as a consumer of the KeywordRecognitionResult
's audio, can catch up after user-facing action has already begun.
This addresses #486 .
Caveats: chaining a KeywordRecognizer
into a DialogServiceConnector
isn't trivial and requires both audio adapters and some state management. Investigation with v1.12 also revealed that multi-turn use of an audio stream derived from a KeywordRecognitionResult
did not automatically consume recognized audio, which made effective use additionally challenging. This automatic consumption behavior is fixed in v1.13 and this change takes a dependency on that fix.
Further, since audio adapters were already necessary, this change also applies said adapters to improve the keyword rejection behavior (and remove the so-called "failsafe timer" approach):
- Prior to this change, all audio is pushed into the Speech SDK objects (
DialogServiceConnector
) as fast as possible, meaning we have no accounting of how much data is/has been consumed at any point - This means we have no way of knowing if we've already evaluated enough audio to determine that there's no keyword in the input -- we instead rely on a wall clock timer ("2.0 real seconds after the 'start audio' call, fire an event that deduces no keyword recognition is going to happen")
- The wall clock, failsafe approach isn't ideal: many variables impact the actual amount of audio we get a chance to process, and that means we need to be very conservative (usually evaluating a lot of extra audio) to ensure we don't give up too quickly in slower configurations/situations; being conservative and consuming extra audio in turn means we have greater periods of "deafness" or unresponsiveness when evaluating false activations, directly harming end-to-end accuracy
- With this change, audio is now pulled into the Speech SDK objects and we can directly monitor how much audio has been requested (and therefore processed)
- This means we can deterministically conclude when a certain duration of audio has been evaluated and reject based on that rather than an error prone wall clock assessment
- This is currently hard-coded to 2.0s of audio, calculated after the existing 2s preroll trim in
AgentAudioProducer
-- this means we'll evaluate an audio range from approximately 1200ms before a keyword detection threshold to approximately 800ms after that keyword detection threshold and conclude "no keyword" if no confirmation result is obtained from that evaluation.
Does this introduce a breaking change?
[ ] Yes
[ ] No
[X] Maybe
Keyword detection metrics are likely impacted by the introduction of the new objects. Efforts were made to preserve the logic but there's likely something regressed that can/should be addressed in a subsequent submission.
Pull Request Type
[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:
How to Test / What to Check
Note: as of draft time, validation still in progress
- Voice activations work: single & multi turn, cold & warm start
- Push-to-talk works, both independently as well as in conjunction with voice activation