azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Add speech recognizer and synthesis on browser interface
Purpose
Enable speech input and output for browser interface.
Does this introduce a breaking change?
[ ] Yes
[x ] No
Pull Request Type
What kind of change does this Pull Request introduce?
[ ] Bugfix
[x ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:
How to Test
- Get the code
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install
tried to implement this, i get blank screen with the error "Uncaught TypeError: HF is not a constructor
EDIT: Mozilla and other browsers usually don't support webkit Speech, had to overwrite default browser settings.
tried to implement this, i get blank screen with the error "Uncaught TypeError: HF is not a constructor QuestionInput.tsx:16 " . The lines with the problem are : const SpeechRecognition = (window as any).speechRecognition|| (window as any).webkitSpeechRecognition; const recognition = new SpeechRecognition();
EDIT: Mozilla and other browsers usually don't support webkit Speech, had to overwrite default browser settings.
Fix the bug: Add try catch for speech recognition constructor. Web Speech API is supported for the following browsers. Recognition can not be used on Mozilla and other browsers but will not throw exception.
Hi, could you help review the PR? Thanks a lot.
@sowu880
This code and integration of the speech is increasing the time of processing the request. A simple change should be let the text generate and display the result and let speech complete in the background without causing the further delay.
integration of the speech is increasing the time of processing the request
Updated. For now, text will display without waiting speech generation.
@sowu880 It seems that this PR doesn't include the creation of the speech resource. Can that be included as an optional resource in the Bicep files? Also, instead of using a key, can it used the ManagedIdentity credential? We are trying to avoid the use of API keys for security reasons.
Please also look at CONTRIBUTING.md
to see how you can run linters and write tests on your code.
@sowu880 It seems that this PR doesn't include the creation of the speech resource. Can that be included as an optional resource in the Bicep files? Also, instead of using a key, can it used the ManagedIdentity credential? We are trying to avoid the use of API keys for security reasons.
Hi, we designed that user should use their existing speech resource rather than creating since it's not required. I'm not sure it's necessary to add in Bicep file for creation.
@sowu880 It seems that this PR doesn't include the creation of the speech resource. Can that be included as an optional resource in the Bicep files? Also, instead of using a key, can it used the ManagedIdentity credential? We are trying to avoid the use of API keys for security reasons.
Our SDK only support key for now.
@sowu880 I see "auth_token" on https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig?view=azure-python , is that a different kind of auth token than the kind you can get from AzureDefaultCredential? I found this snippet that seemed to use it like that: https://github.com/csiebler/azure-cognitive-services-snippets/blob/a60a9a8c06c00ea52e0eccb702cba456f3547e07/aad-authentication/speech.py#L12
For this repo, we aim to have all resources created by Bicep, so that deploys can be replicable. We can still support optional features in Bicep, you can take a look at how Application Insights was added, via a bool parameter and conditionals in Bicep.
@sowu880 I see "auth_token" on https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig?view=azure-python , is that a different kind of auth token than the kind you can get from AzureDefaultCredential? I found this snippet that seemed to use it like that: https://github.com/csiebler/azure-cognitive-services-snippets/blob/a60a9a8c06c00ea52e0eccb702cba456f3547e07/aad-authentication/speech.py#L12
Hi @pamelafox , these comments have been fixed, Here are updates:
- The speech resource will be created from main.bicep by default.
- Customers can set 'useSpeechResource' to false if don't need speech.
- Customers can still use their own speech resource by setting speechServiceName and speechResourceGroupName the same as openai resource.
- AAD auth is used and key auth is removed. Aad token will be refreshed if not valid the same way with openai token.
- Tests are added in test_app.py
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed.
I'm currently working on implementing text-to-speech functionality, and I'm encountering an error in the Chat.tsx and Ask.tsx files. Specifically, the issue arises at the line:
speechUrl = await getSpeechApi(parsedResponse.answer);
I've reviewed the code, but I'm unable to identify the root cause of the error. Your assistance in resolving this matter would be greatly appreciated.
Thank you in advance.
@arsalanmubeen Hi, could you share more error details or log?
@sowu880 The error is like this: TypeError: Cannot read undefined (read '0') error properties from chat completion
Send the chat request to the backend in the new repo and get the response. Taking an ID token But get the speech syntax from the backend and not from the id_Token.
it's like; export async function getSpeechApi(text: string): Promise<string | null> { return await fetch("/speech", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ text: text }) }) .then(response => { if (response.status == 200) { return response.blob(); } else if (response.status == 400) { console.log("Speech synthesis is not enabled."); return null; } else { console.error("Unable to get speech synthesis."); return null; } }) .then(blob => (blob ? URL.createObjectURL(blob) : null)); }
it should be like this;
export async function getSpeechApi(text: string, idToken: string | undefined): Promise<string | null> {
return await fetch(${BACKEND_URI}/speech
, {
method: "POST",
headers: getHeaders(idToken),
body: JSON.stringify({
text: text
})
})
.then(response => {
if (response.status == 200) {
return response.blob();
} else if (response.status == 400) {
console.log("Speech synthesis is not enabled.");
return null;
} else {
console.error("Unable to get speech synthesis.");
return null;
}
})
.then(blob => (blob ? URL.createObjectURL(blob) : null));
}
Hi @pamelafox , Can you help me how set up speech to text functionality in current working Chat/Ask BOT. Thanks!
This is a very handy feature and I would very much appreciate it if this Pull Request was revisited. I notice that feature is available in the https://github.com/Azure-Samples/chat-with-your-data-solution-accelerator repository if someone want inspiration for how to do it!
@daptatea Will be merged soon!
This can be tried out here: https://app-backend-5hhse4yls5chk.azurewebsites.net/
@pamelafox - This is just awesome 😁 I just tried and checked out the YouTube video!!!! I have two (hopefully simple request)
- Ability to choose voice as a configuration variable (Aussie accent 😁).
- Ability to split audio input from audio output. I have used Chrome and iPhone, and sometimes the browser audio SDK doesn’t pick up all my questions. So, I might want to enable audio output but not enable input via configuration. Yes, it’s a bit limited, but the feature is more robust from a customer experience perspective. I hope we can do this.
This is just awesome :)
@zedhaque Is #2 something that you think should be configured on a per-app setting or a per-user setting? There was another setting originally in this PR for "speak all answers" that really felt like it should be a user setting, so I removed it to simplify the PR and defer the decision on user settings UI.
For #1) I agree, I'll do that, we need to make it easy to opt out of a default of en-US.
@pamelafox - IMHO, I would prefer it to be an app setting. We are relying on browser SDKs, and they all work differently. I just tried MS Edge on a MacBook (it shows a pop-up - see attached). I have also noticed that in Safari, my audio-to-text gets autocorrected (sometimes autocorrect works, and sometimes it puts spaces between words, which end up being incorrect). I think this will create additional support tickets/calls in an enterprise setting. So, it’s best if the enterprise admin decides whether to enable it or not (for example, where all browsers/operating systems are the same and the feature works really well).
I agree with you that "speak all answers" is definitely for the future as user settings.
Check Broken URLs
We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.
Check the file paths and associated broken URLs inside them. For more details, check our Contributing Guide.
File Full Path | Issues |
---|---|
./README.md |
1. https://learn.microsoft.com/azure/cognitive-services/manage-resources?tabs=azure-portal#purge-a-deleted-resource |
@zedhaque I've made your suggested changes to split input/output and add a voice option. I also made named the input as INPUT_BROWSER and output as OUTPUT_AZURE as I could imagine us adding INPUT_AZURE or OUTPUT_BROWSER in the future.
@pamelafox - Thank you very much for incorporating my suggestions 👯 I will give a test run and revert back if any issues. Many thanks :)
@pamelafox I deployed a version that solely depends on the Web Speech API in the recognition and synthesis of the speech it's for free. You can test it here: https://dfbsfb-lh4hrrtgs4a42-appservice.azurewebsites.net/
This is the PR where I added it: https://github.com/khelanmodi/build-24-langchain-vcore/pull/47
It's based on the same changes you have here for the speech recognition part but depends on the same tool (Web Speech API) for speech synthesis instead of the Azure Speech API.
You might ask why is the synthesized voice bad. This is the default en-us voice, it's called David and is available on most browsers. You can use better voices from the list available here: https://mdn.github.io/dom-examples/web-speech-api/speak-easy-synthesis/ But each browser has its own set of available voices. When you change the browser using this URL the list of available voices will change which is why I settled for the default one as it's available on most browsers but I think with some extra work this can be customized or even added as a drop-down to the developer settings.
@john0isaac @pamelafox @szhaomsft Seems it's not a standard way to request Microsoft voice through Web Speech API. And it's not a full list of our voices.
The reason we use Azure Speech API because of the great voice quality and prosody with more than 100 locales. And our speech team have released many conversational voice recently. Try new voices. And many of our new voices can beat all competitors in the current marketing. That the reason why we highly recommend to use azure speech resource and we have a big team to support and maintain these product voice.
My suggestion is merge this "speak out" feature first, and then we can continuously upgrade it on other requirements.
@john0isaac Thank you for sharing that, super helpful. I just tried it out and it even works in Edge on Mac (where the browser Speech Recognition does not work yet, sadly). I do agree with @sowu880 that the Azure voices are much more fluid, and I also selected a default for this PR that has the broadest language support possible, since developers use this repo across many languages.
So I think we should get this PR merged, and then could you send a PR to add a USE_SPEECH_OUTPUT_BROWSER option? That should be fairly compatible with the way I've modularized this PR, I think. Either the SpeechOutput component could take an additional answer
prop and a enableBrowserOutput
bool, or there could be a different SpeechOutputBrowser vs SpeechOutputAzure component.
I've asked @mattgotteiner to take a look at this PR now, since it's a large change and large changes can use multiple eyes.
@sowu880 the only advantage is that it's for free so, that's the value that you get from it and of course it won't be as good as using a paid service. I do agree with you that using the Azure Speech API is better but just wanted to demonstrate other options to implement this.
@pamelafox sure I will create a PR once this is merged to add it as an optional low cost feature.