jigasi icon indicating copy to clipboard operation
jigasi copied to clipboard

feat: add Whispering support

Open charles-zablit opened this issue 2 years ago • 9 comments

This PR adds support for Whispering a streaming transcription server based on OpenAI's Whisper.

Whispering's advantage over VOSK is that it supports multiple languages detection and transcription.

The Whispering Transcription service uses WebSockets to communicate with the Whispering server.

This is still a WIP as we still need to fix a sample rate incompatibility issue between Whispering and Jigasi. Right now, we have to set EXPECTED_AUDIO_LENGTH to 25600. We also have to change https://github.com/shirayu/whispering/blob/256bf38b4d3d751e1eac8116f0f7da07e1b9652f/whispering/serve.py#L69 to audio = np.frombuffer(message, dtype=np.int64)

charles-zablit avatar Oct 14 '22 14:10 charles-zablit

Some questions:

  1. Have you tested if real-time transcription is feasible?
  2. What model (tiny/base/etc) are you planning on running?
  3. On what GPU are you planning to run this?
  4. Any thoughts on serving transcriptions from one machine to multiple meetings?

nikvaessen avatar Oct 14 '22 16:10 nikvaessen

Have you tested if real-time transcription is feasible?

It works just as fast as VOSK, however it only starts transcribing after the sentence ends. It does not have partial results, which might make it look slow.

What model (tiny/base/etc) are you planning on running?

Currently we have tested both medium and large, with very good performances.

On what GPU are you planning to run this?

We have run our tests on t1-45 OVH VPS, so an NVIDIA Tesla V100.

Any thoughts on serving transcriptions from one machine to multiple meetings?

We have not tested that yet, but it seems that Whispering supports multiple connections. GPU usage is around 30% on our OVH instance for 1 connection, so multiple connections are doable.

If you want, we plan on presenting our findings at today's Jitsi community call.

charles-zablit avatar Oct 17 '22 08:10 charles-zablit

Codecov Report

Merging #454 (e645d49) into master (dda0721) will decrease coverage by 0.76%. The diff coverage is 0.00%.

:exclamation: Current head e645d49 differs from pull request most recent head d8f88ae. Consider uploading reports for the commit d8f88ae to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #454      +/-   ##
============================================
- Coverage     23.15%   22.39%   -0.77%     
  Complexity      304      304              
============================================
  Files            69       70       +1     
  Lines          5812     6006     +194     
  Branches        790      804      +14     
============================================
- Hits           1346     1345       -1     
- Misses         4235     4430     +195     
  Partials        231      231              
Impacted Files Coverage Δ
...rc/main/java/org/jitsi/jigasi/AbstractGateway.java 68.60% <0.00%> (-11.13%) :arrow_down:
.../java/org/jitsi/jigasi/AbstractGatewaySession.java 63.49% <0.00%> (-4.31%) :arrow_down:
src/main/java/org/jitsi/jigasi/JvbConference.java 44.28% <0.00%> (-1.39%) :arrow_down:
src/main/java/org/jitsi/jigasi/Main.java 22.09% <0.00%> (-1.66%) :arrow_down:
...c/main/java/org/jitsi/jigasi/rest/HandlerImpl.java 0.00% <0.00%> (ø)
...jigasi/transcription/VoskTranscriptionService.java 0.00% <0.00%> (ø)
.../transcription/WhisperingTranscriptionService.java 0.00% <0.00%> (ø)
...in/java/org/jitsi/jigasi/sounds/PlaybackQueue.java 54.38% <0.00%> (-1.76%) :arrow_down:
.../jitsi/jigasi/sounds/SoundNotificationManager.java 29.62% <0.00%> (+0.41%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4964b52...d8f88ae. Read the comment docs.

codecov[bot] avatar Oct 17 '22 09:10 codecov[bot]

@charles-zablit do you have a plan to finish this?

It would be a great feature as i think whisper is currently the best open source STT. I would like to use it for meeting notes.

davidak avatar Mar 22 '23 02:03 davidak

Hi @charles-zablit @nikvaessen Just wondering what happened to this particular Whisper related jigasi integration (which is about a year old)?

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java which appears to be connected to PR #491

and I see that, although it's not mentioned in the README (which makes reference to Google Cloud, Vosk, LibreTranslate), there is now some recent code to link transcription to some sort of Whisper system but, in contrast to what Charles was doing, in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up... whereas Charles, over a year ago, was just about ready with something which would use Whispering (which is MIT licensed) https://github.com/shirayu/whispering/ Unfortunately the PR now has conflicts and the Whispering service project has been archived by its original author given availability of new whisper systems e.g. whisper.cpp which works with CPU inference as well as GPU.

Is there any chance we could still have the Whispering PR integrated since it uses whisper from an open service as opposed to whatever is now in the code-base. If we had an example it might be possible to adapt it to suit one of the newer Whisper implementations available these days? I've also seen some scripts which, if given multiple channels, will do some rough diarising so that the transcript will incorporate multiple named speakers..

Many thanks for your work on all of this.

Best, M.

cryolite-ai avatar Oct 23 '23 20:10 cryolite-ai

in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up...

Where do you see this?

damencho avatar Oct 23 '23 21:10 damencho

Link to source file was in my last post - here it is again:

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java which appears to be connected to PR https://github.com/jitsi/jigasi/pull/491

See line 27...

image

cryolite-ai avatar Oct 24 '23 06:10 cryolite-ai

Hi @charles-zablit @nikvaessen Just wondering what happened to this particular Whisper related jigasi integration (which is about a year old)?

Rummaging through the current codebase I see in a file called: https://github.com/jitsi/jigasi/blob/master/src/main/java/org/jitsi/jigasi/transcription/WhisperTranscriptionService.java which appears to be connected to PR #491

and I see that, although it's not mentioned in the README (which makes reference to Google Cloud, Vosk, LibreTranslate), there is now some recent code to link transcription to some sort of Whisper system but, in contrast to what Charles was doing, in the current code it says 'a custom Whisper server - without any details' and there doesn't seem to be any documentation about what/how to set it up... whereas Charles, over a year ago, was just about ready with something which would use Whispering (which is MIT licensed) https://github.com/shirayu/whispering/ Unfortunately the PR now has conflicts and the Whispering service project has been archived by its original author given availability of new whisper systems e.g. whisper.cpp which works with CPU inference as well as GPU.

Is there any chance we could still have the Whispering PR integrated since it uses whisper from an open service as opposed to whatever is now in the code-base. If we had an example it might be possible to adapt it to suit one of the newer Whisper implementations available these days? I've also seen some scripts which, if given multiple channels, will do some rough diarising so that the transcript will incorporate multiple named speakers..

Many thanks for your work on all of this.

Best, M.

Hi,

We are still in the very early stage with our own Whisper live transcription implementation. We plan to make it open-source in the not so distant future.

Cheers, Razvan

rpurdel avatar Oct 25 '23 13:10 rpurdel

@charles-zablit @nikvaessen @damencho

The whisper live transcription server is now open source under the jitsi/skynet project. It should work out of the box with Jigasi.

rpurdel avatar Feb 08 '24 11:02 rpurdel