azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Add speech recognizer and synthesis on browser interface

Open sowu880 opened this issue 1 year ago • 26 comments

Purpose

Enable speech input and output for browser interface.

Does this introduce a breaking change?

[ ] Yes
[x ] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[x ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Get the code
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install

sowu880 avatar Apr 13 '23 16:04 sowu880

tried to implement this, i get blank screen with the error "Uncaught TypeError: HF is not a constructor QuestionInput.tsx:16 " . The lines with the problem are : const SpeechRecognition = (window as any).speechRecognition|| (window as any).webkitSpeechRecognition; const recognition = new SpeechRecognition();

EDIT: Mozilla and other browsers usually don't support webkit Speech, had to overwrite default browser settings.

unmuntean avatar Apr 19 '23 14:04 unmuntean

tried to implement this, i get blank screen with the error "Uncaught TypeError: HF is not a constructor QuestionInput.tsx:16 " . The lines with the problem are : const SpeechRecognition = (window as any).speechRecognition|| (window as any).webkitSpeechRecognition; const recognition = new SpeechRecognition();

EDIT: Mozilla and other browsers usually don't support webkit Speech, had to overwrite default browser settings.

Fix the bug: Add try catch for speech recognition constructor. Web Speech API is supported for the following browsers. Recognition can not be used on Mozilla and other browsers but will not throw exception. image

sowu880 avatar May 04 '23 07:05 sowu880

Hi, could you help review the PR? Thanks a lot.

sowu880 avatar May 26 '23 05:05 sowu880

@sowu880

This code and integration of the speech is increasing the time of processing the request. A simple change should be let the text generate and display the result and let speech complete in the background without causing the further delay.

Screenshot 2023-06-10 at 8 30 18 PM

vrajroutu avatar Jun 11 '23 00:06 vrajroutu

integration of the speech is increasing the time of processing the request

Updated. For now, text will display without waiting speech generation.

sowu880 avatar Sep 08 '23 08:09 sowu880

@sowu880 It seems that this PR doesn't include the creation of the speech resource. Can that be included as an optional resource in the Bicep files? Also, instead of using a key, can it used the ManagedIdentity credential? We are trying to avoid the use of API keys for security reasons.

pamelafox avatar Sep 08 '23 18:09 pamelafox

Please also look at CONTRIBUTING.md to see how you can run linters and write tests on your code.

pamelafox avatar Sep 08 '23 18:09 pamelafox

@sowu880 It seems that this PR doesn't include the creation of the speech resource. Can that be included as an optional resource in the Bicep files? Also, instead of using a key, can it used the ManagedIdentity credential? We are trying to avoid the use of API keys for security reasons.

Hi, we designed that user should use their existing speech resource rather than creating since it's not required. I'm not sure it's necessary to add in Bicep file for creation.

sowu880 avatar Sep 11 '23 03:09 sowu880

@sowu880 It seems that this PR doesn't include the creation of the speech resource. Can that be included as an optional resource in the Bicep files? Also, instead of using a key, can it used the ManagedIdentity credential? We are trying to avoid the use of API keys for security reasons.

Our SDK only support key for now.

sowu880 avatar Sep 15 '23 12:09 sowu880

@sowu880 I see "auth_token" on https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig?view=azure-python , is that a different kind of auth token than the kind you can get from AzureDefaultCredential? I found this snippet that seemed to use it like that: https://github.com/csiebler/azure-cognitive-services-snippets/blob/a60a9a8c06c00ea52e0eccb702cba456f3547e07/aad-authentication/speech.py#L12

pamelafox avatar Sep 15 '23 13:09 pamelafox

For this repo, we aim to have all resources created by Bicep, so that deploys can be replicable. We can still support optional features in Bicep, you can take a look at how Application Insights was added, via a bool parameter and conditionals in Bicep.

pamelafox avatar Sep 15 '23 13:09 pamelafox

@sowu880 I see "auth_token" on https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig?view=azure-python , is that a different kind of auth token than the kind you can get from AzureDefaultCredential? I found this snippet that seemed to use it like that: https://github.com/csiebler/azure-cognitive-services-snippets/blob/a60a9a8c06c00ea52e0eccb702cba456f3547e07/aad-authentication/speech.py#L12

Hi @pamelafox , these comments have been fixed, Here are updates:

  1. The speech resource will be created from main.bicep by default.
  2. Customers can set 'useSpeechResource' to false if don't need speech.
  3. Customers can still use their own speech resource by setting speechServiceName and speechResourceGroupName the same as openai resource.
  4. AAD auth is used and key auth is removed. Aad token will be refreshed if not valid the same way with openai token.
  5. Tests are added in test_app.py

sowu880 avatar Sep 22 '23 08:09 sowu880

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed.

github-actions[bot] avatar Dec 30 '23 01:12 github-actions[bot]

I'm currently working on implementing text-to-speech functionality, and I'm encountering an error in the Chat.tsx and Ask.tsx files. Specifically, the issue arises at the line:

speechUrl = await getSpeechApi(parsedResponse.answer);

I've reviewed the code, but I'm unable to identify the root cause of the error. Your assistance in resolving this matter would be greatly appreciated.

Thank you in advance.

arsalanmubeen avatar Feb 19 '24 15:02 arsalanmubeen

@arsalanmubeen Hi, could you share more error details or log?

sowu880 avatar Feb 20 '24 03:02 sowu880

@sowu880 The error is like this: TypeError: Cannot read undefined (read '0') error properties from chat completion

image

Send the chat request to the backend in the new repo and get the response. Taking an ID token But get the speech syntax from the backend and not from the id_Token.

it's like; export async function getSpeechApi(text: string): Promise<string | null> { return await fetch("/speech", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ text: text }) }) .then(response => { if (response.status == 200) { return response.blob(); } else if (response.status == 400) { console.log("Speech synthesis is not enabled."); return null; } else { console.error("Unable to get speech synthesis."); return null; } }) .then(blob => (blob ? URL.createObjectURL(blob) : null)); }

it should be like this; export async function getSpeechApi(text: string, idToken: string | undefined): Promise<string | null> { return await fetch(${BACKEND_URI}/speech, { method: "POST", headers: getHeaders(idToken), body: JSON.stringify({ text: text }) }) .then(response => { if (response.status == 200) { return response.blob(); } else if (response.status == 400) { console.log("Speech synthesis is not enabled."); return null; } else { console.error("Unable to get speech synthesis."); return null; } }) .then(blob => (blob ? URL.createObjectURL(blob) : null)); }

arsalanmubeen avatar Feb 20 '24 12:02 arsalanmubeen

Hi @pamelafox , Can you help me how set up speech to text functionality in current working Chat/Ask BOT. Thanks!

DeepAsmani avatar Mar 15 '24 05:03 DeepAsmani

This is a very handy feature and I would very much appreciate it if this Pull Request was revisited. I notice that feature is available in the https://github.com/Azure-Samples/chat-with-your-data-solution-accelerator repository if someone want inspiration for how to do it!

daptatea avatar Apr 26 '24 08:04 daptatea

@daptatea Will be merged soon!

pamelafox avatar May 20 '24 20:05 pamelafox

This can be tried out here: https://app-backend-5hhse4yls5chk.azurewebsites.net/

pamelafox avatar May 20 '24 21:05 pamelafox

@pamelafox - This is just awesome 😁 I just tried and checked out the YouTube video!!!! I have two (hopefully simple request)

  1. Ability to choose voice as a configuration variable (Aussie accent 😁).
  2. Ability to split audio input from audio output. I have used Chrome and iPhone, and sometimes the browser audio SDK doesn’t pick up all my questions. So, I might want to enable audio output but not enable input via configuration. Yes, it’s a bit limited, but the feature is more robust from a customer experience perspective. I hope we can do this.

This is just awesome :)

zedhaque avatar May 21 '24 01:05 zedhaque

@zedhaque Is #2 something that you think should be configured on a per-app setting or a per-user setting? There was another setting originally in this PR for "speak all answers" that really felt like it should be a user setting, so I removed it to simplify the PR and defer the decision on user settings UI.

For #1) I agree, I'll do that, we need to make it easy to opt out of a default of en-US.

pamelafox avatar May 21 '24 03:05 pamelafox

@pamelafox - IMHO, I would prefer it to be an app setting. We are relying on browser SDKs, and they all work differently. I just tried MS Edge on a MacBook (it shows a pop-up - see attached). I have also noticed that in Safari, my audio-to-text gets autocorrected (sometimes autocorrect works, and sometimes it puts spaces between words, which end up being incorrect). I think this will create additional support tickets/calls in an enterprise setting. So, it’s best if the enterprise admin decides whether to enable it or not (for example, where all browsers/operating systems are the same and the feature works really well).

I agree with you that "speak all answers" is definitely for the future as user settings.

Screenshot 2024-05-21 at 2 48 52 PM

zedhaque avatar May 21 '24 05:05 zedhaque

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues
./README.md 1. https://learn.microsoft.com/azure/cognitive-services/manage-resources?tabs=azure-portal#purge-a-deleted-resource

github-actions[bot] avatar May 21 '24 12:05 github-actions[bot]

@zedhaque I've made your suggested changes to split input/output and add a voice option. I also made named the input as INPUT_BROWSER and output as OUTPUT_AZURE as I could imagine us adding INPUT_AZURE or OUTPUT_BROWSER in the future.

pamelafox avatar May 23 '24 21:05 pamelafox

@pamelafox - Thank you very much for incorporating my suggestions 👯 I will give a test run and revert back if any issues. Many thanks :)

zedhaque avatar May 24 '24 00:05 zedhaque

@pamelafox I deployed a version that solely depends on the Web Speech API in the recognition and synthesis of the speech it's for free. You can test it here: https://dfbsfb-lh4hrrtgs4a42-appservice.azurewebsites.net/

This is the PR where I added it: https://github.com/khelanmodi/build-24-langchain-vcore/pull/47

It's based on the same changes you have here for the speech recognition part but depends on the same tool (Web Speech API) for speech synthesis instead of the Azure Speech API.

You might ask why is the synthesized voice bad. This is the default en-us voice, it's called David and is available on most browsers. You can use better voices from the list available here: https://mdn.github.io/dom-examples/web-speech-api/speak-easy-synthesis/ But each browser has its own set of available voices. When you change the browser using this URL the list of available voices will change which is why I settled for the default one as it's available on most browsers but I think with some extra work this can be customized or even added as a drop-down to the developer settings.

john0isaac avatar May 26 '24 20:05 john0isaac

@john0isaac @pamelafox @szhaomsft Seems it's not a standard way to request Microsoft voice through Web Speech API. And it's not a full list of our voices.

The reason we use Azure Speech API because of the great voice quality and prosody with more than 100 locales. And our speech team have released many conversational voice recently. Try new voices. And many of our new voices can beat all competitors in the current marketing. That the reason why we highly recommend to use azure speech resource and we have a big team to support and maintain these product voice.

My suggestion is merge this "speak out" feature first, and then we can continuously upgrade it on other requirements.

sowu880 avatar May 27 '24 05:05 sowu880

@john0isaac Thank you for sharing that, super helpful. I just tried it out and it even works in Edge on Mac (where the browser Speech Recognition does not work yet, sadly). I do agree with @sowu880 that the Azure voices are much more fluid, and I also selected a default for this PR that has the broadest language support possible, since developers use this repo across many languages.

So I think we should get this PR merged, and then could you send a PR to add a USE_SPEECH_OUTPUT_BROWSER option? That should be fairly compatible with the way I've modularized this PR, I think. Either the SpeechOutput component could take an additional answer prop and a enableBrowserOutput bool, or there could be a different SpeechOutputBrowser vs SpeechOutputAzure component.

I've asked @mattgotteiner to take a look at this PR now, since it's a large change and large changes can use multiple eyes.

pamelafox avatar May 28 '24 17:05 pamelafox

@sowu880 the only advantage is that it's for free so, that's the value that you get from it and of course it won't be as good as using a paid service. I do agree with you that using the Azure Speech API is better but just wanted to demonstrate other options to implement this.

@pamelafox sure I will create a PR once this is merged to add it as an optional low cost feature.

john0isaac avatar May 28 '24 18:05 john0isaac