WhisperLiveKit icon indicating copy to clipboard operation
WhisperLiveKit copied to clipboard

Click the microphone button can not start recording

Open Aurora-6 opened this issue 11 months ago • 17 comments

Following the readme file, the webpage can be opened smoothly, but when I click the microphone button, the prompt "Click to start transcription" is displayed all the time. Unable to start recording and transcribe. What can I do to fix it? 企业微信截图_5d60cc4d-ba66-47b3-807f-6ad73db7c5c2 Clipboard_Screenshot_1735029585

Aurora-6 avatar Dec 24 '24 08:12 Aurora-6

Hi @Aurora-6

According to the logs, your server is running on port 8080. However, the WebSocket URL on the front page is still set to its default value, which uses port 8000. You can either deploy the server on port 8000 or update the WebSocket URL to use port 8080.

It also seems that you are using Safari, which does not work in this case. Safari does not support the WebM/Opus open format. I recommend using a Chromium/Blink-based browser (e.g., Chrome) or a Gecko-based browser (e.g., Firefox) instead.

Best regards, Quentin

QuentinFuxa avatar Dec 24 '24 08:12 QuentinFuxa

I open it using Chrome and change the "WebSocket URL" same as server host and port. But it still doesn't work. WebSocket connection opened. Clicking the button still doesn't respond. 企业微信截图_2fe68bdf-7511-4336-8d5f-9ce147f3b62b 企业微信截图_926fd0e5-dabb-42a3-bd26-632599757ddb

Aurora-6 avatar Dec 24 '24 09:12 Aurora-6

image

When you click the button, you should see a pop-up asking for microphone permissions. The button will turn red when permissions are granted.

image

QuentinFuxa avatar Dec 24 '24 09:12 QuentinFuxa

The web page did not actively ask me if I was allowed to use the microphone. And I can't change the microphone permissions manually on the web page. 企业微信截图_a5025742-815a-4598-a72d-bb51732a2243

Aurora-6 avatar Dec 24 '24 09:12 Aurora-6

This is not a whisper_streaming_web issue. You likely have restrictions on your machine or browser regarding the permissions that can be granted to webpages. Let me know if you’re able to resolve this problem on your side.

QuentinFuxa avatar Dec 24 '24 09:12 QuentinFuxa

This is not a whisper_streaming_web issue. You likely have restrictions on your machine or browser regarding the permissions that can be granted to webpages. Let me know if you’re able to resolve this problem on your side.

Change http to https, uvicorn.run() need to provide an ssl certificate.

Set the port as 8000 and use "wss" instead of "ws" in WebSocket URL. I can click the button to start recording, but no transcription. No information shows on the server. 企业微信截图_b92ea312-1d90-42fa-b240-ec2525332a02 企业微信截图_9eb19d94-26e0-40a6-bf1c-47e1af7f97d7

Aurora-6 avatar Dec 24 '24 10:12 Aurora-6

URL should be wss://localhost:8000/ws . wss:// is the protocol. /ws is the endpoint, it does not change

QuentinFuxa avatar Dec 24 '24 10:12 QuentinFuxa

URL should be wss://localhost:8000/ws . wss:// is the protocol. /ws is the endpoint, it does not change

Thank you! Now it can output the transcription, but it will be stuck during decoding. 企业微信截图_4c83fb69-6347-4870-b990-54b75c33a841

Aurora-6 avatar Dec 24 '24 10:12 Aurora-6

Can you check in Chrome, in dev tools and network tabs, if the browser sends data to /ws endpoint, and if the server sends results back to the browser? image

QuentinFuxa avatar Dec 24 '24 11:12 QuentinFuxa

So i ran the whisper_fastapi_online_server.py and this is the terminal result. At the last VAD: voice it freezes. I am attaching the networks data too.

image

(Terminal) web output

WebSocket connection opened.
INFO:     connection open
VAD: None
VAD: None
VAD: None
VAD: voice
Transcribing
Groq API processed accumulated 2 seconds
Transcription(text=' 1,2,3,4', task='transcribe', language='English', duration=1.74, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.76, 
'text': ' 1,2,3,4', 'tokens': [50365, 502, 11, 17, 11, 18, 11, 19, 50453], 'temperature': 0, 'avg_logprob': -0.21730661, 'compression_ratio':
 0.46666667, 'no_speech_prob': 0.20641193, 'words': [{'word': '1,2,3,4', 'start': 0, 'end': 1.76}]}], x_groq={'id': 'req_01jg5qgj06edjtn7c54krsy2wd'})
VAD: nonvoice
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 4 seconds
Transcription(text=' 5, 6, 7,', task='transcribe', language='English', duration=1.11, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.12,
 'text': ' 5, 6, 7,', 'tokens': [50365, 1025, 11, 1386, 11, 1614, 11, 50421], 'temperature': 0, 'avg_logprob': -0.30156884, 'compression_rati
o': 0.5, 'no_speech_prob': 0.13851482, 'words': [{'word': '5, ', 'start': 0, 'end': 1.12}, {'word': '6, ', 'start': 0, 'end': 1.12}, {'word': '7,', 'start': 0, 'end': 1.12}]}], x_groq={'id': 'req_01jg5qgpwcesabf1a4spy0qkch'})
VAD: nonvoice
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 6 seconds    
Transcription(text=' 11 12 13 14', task='transcribe', language='English', duration=1.95, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.
96, 'text': ' 11 12 13 14', 'tokens': [50365, 2975, 2272, 3705, 3499, 50463], 'temperature': 0, 'avg_logprob': -0.27616283, 'compression_rati
o': 0.57894737, 'no_speech_prob': 0.30244395, 'words': [{'word': '11 ', 'start': 0, 'end': 1.96}, {'word': '12 ', 'start': 0, 'end': 1.96}, {'word': '13 ', 'start': 0, 'end': 1.96}, {'word': '14', 'start': 0, 'end': 1.96}]}], x_groq={'id': 'req_01jg5qgtq9edkv4ebdxn1nqefr'})        
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 8 seconds
Transcription(text=' Oh, hey, can you tell me', task='transcribe', language='English', duration=1.34, segments=[{'id': 0, 'seek': 0, 'start':
 0, 'end': 1.34, 'text': ' Oh, hey, can you tell me', 'tokens': [50365, 876, 11, 4177, 11, 393, 291, 980, 385, 50432], 'temperature': 0, 'avg
_logprob': -0.3739299, 'compression_ratio': 0.75, 'no_speech_prob': 3.5308383e-06, 'words': [{'word': 'Oh, ', 'start': 0, 'end': 1.34}, {'wor
d': 'hey, ', 'start': 0, 'end': 1.34}, {'word': 'can ', 'start': 0, 'end': 1.34}, {'word': 'you ', 'start': 0, 'end': 1.34}, {'word': 'tell ', 'start': 0, 'end': 1.34}, {'word': 'me', 'start': 0, 'end': 1.34}]}], x_groq={'id': 'req_01jg5qgywafy9r91770dsejehp'})
VAD: voiceWebSocket connection opened.
INFO:     connection open
VAD: None
VAD: None
VAD: None
VAD: voice
Transcribing
Groq API processed accumulated 2 seconds
Transcription(text=' 1,2,3,4', task='transcribe', language='English', duration=1.74, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.76, 
'text': ' 1,2,3,4', 'tokens': [50365, 502, 11, 17, 11, 18, 11, 19, 50453], 'temperature': 0, 'avg_logprob': -0.21730661, 'compression_ratio':
 0.46666667, 'no_speech_prob': 0.20641193, 'words': [{'word': '1,2,3,4', 'start': 0, 'end': 1.76}]}], x_groq={'id': 'req_01jg5qgj06edjtn7c54krsy2wd'})
Transcription(text=' 1,2,3,4', task='transcribe', language='English', duration=1.74, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.76, 
'text': ' 1,2,3,4', 'tokens': [50365, 502, 11, 17, 11, 18, 11, 19, 50453], 'temperature': 0, 'avg_logprob': -0.21730661, 'compression_ratio':
 0.46666667, 'no_speech_prob': 0.20641193, 'words': [{'word': '1,2,3,4', 'start': 0, 'end': 1.76}]}], x_groq={'id': 'req_01jg5qgj06edjtn7c54krsy2wd'})
VAD: nonvoice
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 4 seconds
Transcription(text=' 5, 6, 7,', task='transcribe', language='English', duration=1.11, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.12,
 'text': ' 5, 6, 7,', 'tokens': [50365, 1025, 11, 1386, 11, 1614, 11, 50421], 'temperature': 0, 'avg_logprob': -0.30156884, 'compression_rati
o': 0.5, 'no_speech_prob': 0.13851482, 'words': [{'word': '5, ', 'start': 0, 'end': 1.12}, {'word': '6, ', 'start': 0, 'end': 1.12}, {'word': '7,', 'start': 0, 'end': 1.12}]}], x_groq={'id': 'req_01jg5qgpwcesabf1a4spy0qkch'})
Transcription(text=' 5, 6, 7,', task='transcribe', language='English', duration=1.11, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.12,
 'text': ' 5, 6, 7,', 'tokens': [50365, 1025, 11, 1386, 11, 1614, 11, 50421], 'temperature': 0, 'avg_logprob': -0.30156884, 'compression_rati
o': 0.5, 'no_speech_prob': 0.13851482, 'words': [{'word': '5, ', 'start': 0, 'end': 1.12}, {'word': '6, ', 'start': 0, 'end': 1.12}, {'word': '7,', 'start': 0, 'end': 1.12}]}], x_groq={'id': 'req_01jg5qgpwcesabf1a4spy0qkch'})
VAD: nonvoice
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 6 seconds
Transcription(text=' 11 12 13 14', task='transcribe', language='English', duration=1.95, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.
96, 'text': ' 11 12 13 14', 'tokens': [50365, 2975, 2272, 3705, 3499, 50463], 'temperature': 0, 'avg_logprob': -0.27616283, 'compression_rati
o': 0.57894737, 'no_speech_prob': 0.30244395, 'words': [{'word': '11 ', 'start': 0, 'end': 1.96}, {'word': '12 ', 'start': 0, 'end': 1.96}, {'word': '13 ', 'start': 0, 'end': 1.96}, {'word': '14', 'start': 0, 'end': 1.96}]}], x_groq={'id': 'req_01jg5qgtq9edkv4ebdxn1nqefr'})        
Transcription(text=' 11 12 13 14', task='transcribe', language='English', duration=1.95, segments=[{'id': 0, 'seek': 0, 'start': 0, 'end': 1.
96, 'text': ' 11 12 13 14', 'tokens': [50365, 2975, 2272, 3705, 3499, 50463], 'temperature': 0, 'avg_logprob': -0.27616283, 'compression_rati
o': 0.57894737, 'no_speech_prob': 0.30244395, 'words': [{'word': '11 ', 'start': 0, 'end': 1.96}, {'word': '12 ', 'start': 0, 'end': 1.96}, {'word': '13 ', 'start': 0, 'end': 1.96}, {'word': '14', 'start': 0, 'end': 1.96}]}], x_groq={'id': 'req_01jg5qgtq9edkv4ebdxn1nqefr'})        
VAD: nonvoice
VAD: voice
Transcribing
Groq API processed accumulated 8 seconds
Transcription(text=' Oh, hey, can you tell me', task='transcribe', language='English', duration=1.34, segments=[{'id': 0, 'seek': 0, 'start':
 0, 'end': 1.34, 'text': ' Oh, hey, can you tell me', 'tokens': [50365, 876, 11, 4177, 11, 393, 291, 980, 385, 50432], 'temperature': 0, 'avg
_logprob': -0.3739299, 'compression_ratio': 0.75, 'no_speech_prob': 3.5308383e-06, 'words': [{'word': 'Oh, ', 'start': 0, 'end': 1.34}, {'wor
d': 'hey, ', 'start': 0, 'end': 1.34}, {'word': 'can ', 'start': 0, 'end': 1.34}, {'word': 'you ', 'start': 0, 'end': 1.34}, {'word': 'tell ', 'start': 0, 'end': 1.34}, {'word': 'me', 'start': 0, 'end': 1.34}]}], x_groq={'id': 'req_01jg5qgywafy9r91770dsejehp'})
Transcription(text=' Oh, hey, can you tell me', task='transcribe', language='English', duration=1.34, segments=[{'id': 0, 'seek': 0, 'start':
 0, 'end': 1.34, 'text': ' Oh, hey, can you tell me', 'tokens': [50365, 876, 11, 4177, 11, 393, 291, 980, 385, 50432], 'temperature': 0, 'avg
_logprob': -0.3739299, 'compression_ratio': 0.75, 'no_speech_prob': 3.5308383e-06, 'words': [{'word': 'Oh, ', 'start': 0, 'end': 1.34}, {'wor
d': 'hey, ', 'start': 0, 'end': 1.34}, {'word': 'can ', 'start': 0, 'end': 1.34}, {'word': 'you ', 'start': 0, 'end': 1.34}, {'word': 'tell ', 'start': 0, 'end': 1.34}, {'word': 'me', 'start': 0, 'end': 1.34}]}], x_groq={'id': 'req_01jg5qgywafy9r91770dsejehp'})
VAD: voice

rupnil-codes avatar Dec 28 '24 04:12 rupnil-codes

I also noticed that the mic is continuously sending binary message(the mic audio perhaps) BUT It is not being processed. So my initial thought that the VACOnlineASRProcessor is bugging maybe true. Here's the screenshot:

image

rupnil-codes avatar Dec 28 '24 04:12 rupnil-codes

Yes, I’ll take a look at it. I ran some tests, and it seems the VAC is indeed not stable. (The binary messages correspond to the mic audio. It’s normal to have more binary messages than responses, as this depends on the compute speed of your server)

QuentinFuxa avatar Dec 28 '24 08:12 QuentinFuxa

Yes, I’ll take a look at it. I ran some tests, and it seems the VAC is indeed not stable. (The binary messages correspond to the mic audio. It’s normal to have more binary messages than responses, as this depends on the compute speed of your server)

Yep that's what I thought! Thanks a lot

rupnil-codes avatar Dec 28 '24 08:12 rupnil-codes

@QuentinFuxa Btw could it be easier if you would have implemented the Picovoice Cobra VAD? I saw that it was more accurate, fast and lightweight. If u wanna look into it, see the links:

  • Demo: https://picovoice.ai/platform/cobra/
  • Docs: https://picovoice.ai/docs/cobra/

rupnil-codes avatar Dec 28 '24 08:12 rupnil-codes

@QuentinFuxa Any updates??

rupnil-codes avatar Dec 29 '24 16:12 rupnil-codes

@rupnil-codes Once I got chrome to make a popups. which was blocked by default. I switched back to ws from wss and now I got the popup and it runs.

SilasK avatar Dec 30 '24 16:12 SilasK

I also had the problem of blocked microphone and managed to fix it. However I am not 100% what I did to fix it. heh..

For me on mac using a chrome (but also on arc browser) I had the problem that the local chost is characterized as unsecure and hence popups + microphone is blocked. There would be a error message but it disapears too rapidly (#5).

Allowing the options to show popups and then restart browser and server think made the poupups apear that let me allow acces to mic and so run the streaming pipeline.

I also did the change from ws to wss https://github.com/QuentinFuxa/whisper_streaming_web/issues/1#issuecomment-2560981888 I don't know if this was necessary to make the popup apear. Anyway I undid the changes and am now using ws as before. @QuentinFuxa Thank you for the great pipeline especially also the whisper-mlx!

SilasK avatar Dec 30 '24 21:12 SilasK