Click the microphone button can not start recording
Following the readme file, the webpage can be opened smoothly, but when I click the microphone button, the prompt "Click to start transcription" is displayed all the time. Unable to start recording and transcribe. What can I do to fix it?
Hi @Aurora-6
According to the logs, your server is running on port 8080. However, the WebSocket URL on the front page is still set to its default value, which uses port 8000. You can either deploy the server on port 8000 or update the WebSocket URL to use port 8080.
It also seems that you are using Safari, which does not work in this case. Safari does not support the WebM/Opus open format. I recommend using a Chromium/Blink-based browser (e.g., Chrome) or a Gecko-based browser (e.g., Firefox) instead.
Best regards, Quentin
I open it using Chrome and change the "WebSocket URL" same as server host and port. But it still doesn't work. WebSocket connection opened. Clicking the button still doesn't respond.
When you click the button, you should see a pop-up asking for microphone permissions. The button will turn red when permissions are granted.
The web page did not actively ask me if I was allowed to use the microphone. And I can't change the microphone permissions manually on the web page.
This is not a whisper_streaming_web issue. You likely have restrictions on your machine or browser regarding the permissions that can be granted to webpages. Let me know if you’re able to resolve this problem on your side.
This is not a whisper_streaming_web issue. You likely have restrictions on your machine or browser regarding the permissions that can be granted to webpages. Let me know if you’re able to resolve this problem on your side.
Change http to https, uvicorn.run() need to provide an ssl certificate.
Set the port as 8000 and use "wss" instead of "ws" in WebSocket URL. I can click the button to start recording, but no transcription. No information shows on the server.
URL should be wss://localhost:8000/ws . wss:// is the protocol. /ws is the endpoint, it does not change
URL should be
wss://localhost:8000/ws.wss://is the protocol./wsis the endpoint, it does not change
Thank you! Now it can output the transcription, but it will be stuck during decoding.
Can you check in Chrome, in dev tools and network tabs, if the browser sends data to /ws endpoint, and if the server sends results back to the browser?
So i ran the whisper_fastapi_online_server.py and this is the terminal result. At the last VAD: voice it freezes. I am attaching the networks data too.
(Terminal) web output
WebSocket connection opened.
INFO: connection open
VAD: None
VAD: None
VAD: None
VAD: voice
Transcribing
Groq API processed accumulated 2 seconds
Transcription(text=' 1,2,3,4', task='transcribe', language='English', duration=1.74, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.76,
'text': ' 1,2,3,4', 'tokens': [50365, 502, 11, 17, 11, 18, 11, 19, 50453], 'temperature': 0, 'avg_logprob': -0.21730661, 'compression_ratio':
0.46666667, 'no_speech_prob': 0.20641193, 'words': [{'word': '1,2,3,4', 'start': 0, 'end': 1.76}]}], x_groq={'id': 'req_01jg5qgj06edjtn7c54krsy2wd'})
VAD: nonvoice
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 4 seconds
Transcription(text=' 5, 6, 7,', task='transcribe', language='English', duration=1.11, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.12,
'text': ' 5, 6, 7,', 'tokens': [50365, 1025, 11, 1386, 11, 1614, 11, 50421], 'temperature': 0, 'avg_logprob': -0.30156884, 'compression_rati
o': 0.5, 'no_speech_prob': 0.13851482, 'words': [{'word': '5, ', 'start': 0, 'end': 1.12}, {'word': '6, ', 'start': 0, 'end': 1.12}, {'word': '7,', 'start': 0, 'end': 1.12}]}], x_groq={'id': 'req_01jg5qgpwcesabf1a4spy0qkch'})
VAD: nonvoice
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 6 seconds
Transcription(text=' 11 12 13 14', task='transcribe', language='English', duration=1.95, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.
96, 'text': ' 11 12 13 14', 'tokens': [50365, 2975, 2272, 3705, 3499, 50463], 'temperature': 0, 'avg_logprob': -0.27616283, 'compression_rati
o': 0.57894737, 'no_speech_prob': 0.30244395, 'words': [{'word': '11 ', 'start': 0, 'end': 1.96}, {'word': '12 ', 'start': 0, 'end': 1.96}, {'word': '13 ', 'start': 0, 'end': 1.96}, {'word': '14', 'start': 0, 'end': 1.96}]}], x_groq={'id': 'req_01jg5qgtq9edkv4ebdxn1nqefr'})
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 8 seconds
Transcription(text=' Oh, hey, can you tell me', task='transcribe', language='English', duration=1.34, segments=[{'id': 0, 'seek': 0, 'start':
0, 'end': 1.34, 'text': ' Oh, hey, can you tell me', 'tokens': [50365, 876, 11, 4177, 11, 393, 291, 980, 385, 50432], 'temperature': 0, 'avg
_logprob': -0.3739299, 'compression_ratio': 0.75, 'no_speech_prob': 3.5308383e-06, 'words': [{'word': 'Oh, ', 'start': 0, 'end': 1.34}, {'wor
d': 'hey, ', 'start': 0, 'end': 1.34}, {'word': 'can ', 'start': 0, 'end': 1.34}, {'word': 'you ', 'start': 0, 'end': 1.34}, {'word': 'tell ', 'start': 0, 'end': 1.34}, {'word': 'me', 'start': 0, 'end': 1.34}]}], x_groq={'id': 'req_01jg5qgywafy9r91770dsejehp'})
VAD: voiceWebSocket connection opened.
INFO: connection open
VAD: None
VAD: None
VAD: None
VAD: voice
Transcribing
Groq API processed accumulated 2 seconds
Transcription(text=' 1,2,3,4', task='transcribe', language='English', duration=1.74, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.76,
'text': ' 1,2,3,4', 'tokens': [50365, 502, 11, 17, 11, 18, 11, 19, 50453], 'temperature': 0, 'avg_logprob': -0.21730661, 'compression_ratio':
0.46666667, 'no_speech_prob': 0.20641193, 'words': [{'word': '1,2,3,4', 'start': 0, 'end': 1.76}]}], x_groq={'id': 'req_01jg5qgj06edjtn7c54krsy2wd'})
Transcription(text=' 1,2,3,4', task='transcribe', language='English', duration=1.74, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.76,
'text': ' 1,2,3,4', 'tokens': [50365, 502, 11, 17, 11, 18, 11, 19, 50453], 'temperature': 0, 'avg_logprob': -0.21730661, 'compression_ratio':
0.46666667, 'no_speech_prob': 0.20641193, 'words': [{'word': '1,2,3,4', 'start': 0, 'end': 1.76}]}], x_groq={'id': 'req_01jg5qgj06edjtn7c54krsy2wd'})
VAD: nonvoice
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 4 seconds
Transcription(text=' 5, 6, 7,', task='transcribe', language='English', duration=1.11, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.12,
'text': ' 5, 6, 7,', 'tokens': [50365, 1025, 11, 1386, 11, 1614, 11, 50421], 'temperature': 0, 'avg_logprob': -0.30156884, 'compression_rati
o': 0.5, 'no_speech_prob': 0.13851482, 'words': [{'word': '5, ', 'start': 0, 'end': 1.12}, {'word': '6, ', 'start': 0, 'end': 1.12}, {'word': '7,', 'start': 0, 'end': 1.12}]}], x_groq={'id': 'req_01jg5qgpwcesabf1a4spy0qkch'})
Transcription(text=' 5, 6, 7,', task='transcribe', language='English', duration=1.11, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.12,
'text': ' 5, 6, 7,', 'tokens': [50365, 1025, 11, 1386, 11, 1614, 11, 50421], 'temperature': 0, 'avg_logprob': -0.30156884, 'compression_rati
o': 0.5, 'no_speech_prob': 0.13851482, 'words': [{'word': '5, ', 'start': 0, 'end': 1.12}, {'word': '6, ', 'start': 0, 'end': 1.12}, {'word': '7,', 'start': 0, 'end': 1.12}]}], x_groq={'id': 'req_01jg5qgpwcesabf1a4spy0qkch'})
VAD: nonvoice
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 6 seconds
Transcription(text=' 11 12 13 14', task='transcribe', language='English', duration=1.95, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.
96, 'text': ' 11 12 13 14', 'tokens': [50365, 2975, 2272, 3705, 3499, 50463], 'temperature': 0, 'avg_logprob': -0.27616283, 'compression_rati
o': 0.57894737, 'no_speech_prob': 0.30244395, 'words': [{'word': '11 ', 'start': 0, 'end': 1.96}, {'word': '12 ', 'start': 0, 'end': 1.96}, {'word': '13 ', 'start': 0, 'end': 1.96}, {'word': '14', 'start': 0, 'end': 1.96}]}], x_groq={'id': 'req_01jg5qgtq9edkv4ebdxn1nqefr'})
Transcription(text=' 11 12 13 14', task='transcribe', language='English', duration=1.95, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.
96, 'text': ' 11 12 13 14', 'tokens': [50365, 2975, 2272, 3705, 3499, 50463], 'temperature': 0, 'avg_logprob': -0.27616283, 'compression_rati
o': 0.57894737, 'no_speech_prob': 0.30244395, 'words': [{'word': '11 ', 'start': 0, 'end': 1.96}, {'word': '12 ', 'start': 0, 'end': 1.96}, {'word': '13 ', 'start': 0, 'end': 1.96}, {'word': '14', 'start': 0, 'end': 1.96}]}], x_groq={'id': 'req_01jg5qgtq9edkv4ebdxn1nqefr'})
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 8 seconds
Transcription(text=' Oh, hey, can you tell me', task='transcribe', language='English', duration=1.34, segments=[{'id': 0, 'seek': 0, 'start':
0, 'end': 1.34, 'text': ' Oh, hey, can you tell me', 'tokens': [50365, 876, 11, 4177, 11, 393, 291, 980, 385, 50432], 'temperature': 0, 'avg
_logprob': -0.3739299, 'compression_ratio': 0.75, 'no_speech_prob': 3.5308383e-06, 'words': [{'word': 'Oh, ', 'start': 0, 'end': 1.34}, {'wor
d': 'hey, ', 'start': 0, 'end': 1.34}, {'word': 'can ', 'start': 0, 'end': 1.34}, {'word': 'you ', 'start': 0, 'end': 1.34}, {'word': 'tell ', 'start': 0, 'end': 1.34}, {'word': 'me', 'start': 0, 'end': 1.34}]}], x_groq={'id': 'req_01jg5qgywafy9r91770dsejehp'})
Transcription(text=' Oh, hey, can you tell me', task='transcribe', language='English', duration=1.34, segments=[{'id': 0, 'seek': 0, 'start':
0, 'end': 1.34, 'text': ' Oh, hey, can you tell me', 'tokens': [50365, 876, 11, 4177, 11, 393, 291, 980, 385, 50432], 'temperature': 0, 'avg
_logprob': -0.3739299, 'compression_ratio': 0.75, 'no_speech_prob': 3.5308383e-06, 'words': [{'word': 'Oh, ', 'start': 0, 'end': 1.34}, {'wor
d': 'hey, ', 'start': 0, 'end': 1.34}, {'word': 'can ', 'start': 0, 'end': 1.34}, {'word': 'you ', 'start': 0, 'end': 1.34}, {'word': 'tell ', 'start': 0, 'end': 1.34}, {'word': 'me', 'start': 0, 'end': 1.34}]}], x_groq={'id': 'req_01jg5qgywafy9r91770dsejehp'})
VAD: voice
I also noticed that the mic is continuously sending binary message(the mic audio perhaps) BUT It is not being processed. So my initial thought that the VACOnlineASRProcessor is bugging maybe true. Here's the screenshot:
Yes, I’ll take a look at it. I ran some tests, and it seems the VAC is indeed not stable. (The binary messages correspond to the mic audio. It’s normal to have more binary messages than responses, as this depends on the compute speed of your server)
Yes, I’ll take a look at it. I ran some tests, and it seems the VAC is indeed not stable. (The binary messages correspond to the mic audio. It’s normal to have more binary messages than responses, as this depends on the compute speed of your server)
Yep that's what I thought! Thanks a lot
@QuentinFuxa Btw could it be easier if you would have implemented the Picovoice Cobra VAD? I saw that it was more accurate, fast and lightweight. If u wanna look into it, see the links:
- Demo: https://picovoice.ai/platform/cobra/
- Docs: https://picovoice.ai/docs/cobra/
@QuentinFuxa Any updates??
@rupnil-codes Once I got chrome to make a popups. which was blocked by default. I switched back to ws from wss and now I got the popup and it runs.
I also had the problem of blocked microphone and managed to fix it. However I am not 100% what I did to fix it. heh..
For me on mac using a chrome (but also on arc browser) I had the problem that the local chost is characterized as unsecure and hence popups + microphone is blocked. There would be a error message but it disapears too rapidly (#5).
Allowing the options to show popups and then restart browser and server think made the poupups apear that let me allow acces to mic and so run the streaming pipeline.
I also did the change from ws to wss https://github.com/QuentinFuxa/whisper_streaming_web/issues/1#issuecomment-2560981888 I don't know if this was necessary to make the popup apear. Anyway I undid the changes and am now using ws as before. @QuentinFuxa Thank you for the great pipeline especially also the whisper-mlx!