spreed icon indicating copy to clipboard operation
spreed copied to clipboard

Live transcription followup

Open julien-nc opened this issue 8 months ago • 12 comments

Hey @nickvergessen, we are halfway there implementing the live transcription exApp.

The app is connecting to the signaling server, authenticating, receiving the list of participants, getting audio streams, transcoding the streams to feed them to the transcription engine, producing transcriptions and sending them as Talk chat messages (for now). We are still in the process of making the app react to people joining and leaving calls, managing all the parallel transcriptions subprocesses etc...

We now have a clearer idea of how Talk and the live transcription app can interact.

When a participant joins a call

On the UI side, the users could be able to choose whether they want to receive/see the transcriptions with a checkbox somewhere. Maybe in the "media settings" modal, like the call recording consent. As you wish.

If the user wants to see transcriptions, Talk can send a request to the live_transcription exApp on its /transcribeCall endpoint with those params:

  • roomToken
  • sessionId

Here is an example on how to make a request to an exApp endpoint from NC's backend: https://github.com/nextcloud/context_chat/blob/2ea9768bec56d0ea3dbe1551d3680b77f6ea48f4/lib/Service/LangRopeService.php#L123-L130

The transcription messages

The exApp will send signaling messages to all participants who requested the transcriptions. We are flexible on the format of those messages. The simplest would be:

{
    "sessionId": "blabla",
    "transcriptionMessage": "there you go",
}

We are not fully decided about sending the intermediate transcriptions or only the ones considered "definitive". Let's consider that the app will only send definitive transcriptions. If we change that, we can always add a type (intermediate/definitive) attribute in the signaling messages.

Questions

  • Are you fine with all that?
  • Did we forget something?
  • Are there some decisions we should make all together?

julien-nc avatar May 07 '25 12:05 julien-nc

We are not fully decided about sending the intermediate transcriptions or only the ones considered "definitive". Let's consider that the app will only send definitive transcriptions. If we change that, we can always add a type (intermediate/definitive) attribute in the signaling messages.

What would be the idea/use of it? Either it's shown to the user or not? :P There is no "maybe show it to the user"?

By "participantId" you mean the sessionId, so you can send signaling messages, right? Otherwise this needs clarifying as the attendee id is currently not meant to be used by normal users.

nickvergessen avatar May 07 '25 14:05 nickvergessen

What would be the idea/use of it? Either it's shown to the user or not? :P There is no "maybe show it to the user"?

It would be to display the words as soon as they are generated by the transcription engine. Forget it. We will only send full and definitive sentences.

By "participantId" you mean the sessionId, so you can send signaling messages, right? Otherwise this needs clarifying as the attendee id is currently not meant to be used by normal users.

Yes, sorry, we need the sessionId and we'll use it to send signaling messages. And in the messages, we will identify who talked with the sessionId as well.

julien-nc avatar May 07 '25 14:05 julien-nc

If the user wants to see transcriptions, Talk can send a request to the live_transcription exApp on its /transcribeCall endpoint with those params:

  • roomToken
  • sessionId

There should be an additional parameter to enable or disable the transcription, as the user may want to turn it on or off during the call.

Independently of that, a little detail to keep in mind: although extremely unlikely, due to the asynchronous nature of the messages it could happen that a participant joins the call and enables transcriptions, the transcription service starts, and the first list of participants in the call that it receives from the signaling server does not contain yet that participant. If the transcription service is shutdown when none of the participants that requested a transcription are in the call (which from my point of view would be the expected behaviour) you may want to add a delay or something like that to ensure that the transcription service is not shutdown too early.

Similarly, the opposite could potentially happen too: a participant enables transcriptions but leaves the call before the transcription service receives a list of participants in the call that includes that participant. I do not know if that could be relevant or not to the shutdown logic that you implement, but I mention it just in case.

We are flexible on the format of those messages. The simplest would be:

{
    "sessionId": "blabla",
    "transcriptionMessage": "there you go",
}

For consistency with other signaling messages it should also include a "type": "transcription" attribute (next to sessionId and transcriptionMessage in that payload, independently of the outer "type": "message").

danxuliu avatar May 08 '25 11:05 danxuliu

There should be an additional parameter to enable or disable the transcription, as the user may want to turn it on or off during the call.

good point, we missed that this request itself would convey that info too.

Nice of you to mention that, yes a timeout would definitely makes sense here so we don't waste time in restart of the service. It will also be useful when participants rejoin, which happens in a relatively short period of time. We would have a similar kind of timeout for the transcription server too.

Similarly, the opposite could potentially happen too: a participant enables transcriptions but leaves the call before the transcription service receives a list of participants in the call that includes that participant. I do not know if that could be relevant or not to the shutdown logic that you implement, but I mention it just in case.

Maybe we can consider this a rare occurance to not impact performance/efficiency in most of the systems. Also the timeout for leaving the room can decided to be shorter to help with this.

For consistency with other signaling messages it should also include a "type": "transcription" attribute (next to sessionId and transcriptionMessage in that payload, independently of the outer "type": "message").

thanks for pointing that out!

kyteinsky avatar May 12 '25 10:05 kyteinsky

It would be nice if the Talk team could start implementing the feature in parallel in the Talk UI as well. We have a working prototype with all the basic functions working.

This PHP snippet can be used to call the transcription endpoint in the "live_transcription" ex-app, where sessionId is the session ID of the participant requesting the transcriptions and enable is the option to enable/disable transcriptions, as discussed here. An example with checks can be found here: https://github.com/nextcloud/context_chat/blob/2ea9768bec56d0ea3dbe1551d3680b77f6ea48f4/lib/Service/LangRopeService.php#L46

try {
	$appApiFunctions = \OCP\Server::get(\OCA\AppAPI\PublicFunctions::class);
} catch (ContainerExceptionInterface|NotFoundExceptionInterface $e) {
	throw new RuntimeException('Could not get AppAPI public functions');
}

$params = [
	'roomToken' => string,
	'sessionId' => string,
	'enable' => bool,
];
$response = $appApiFunctions->exAppRequest(
	'live_transcription', // $appId
	'/transcribeCall',    // $route
	null,                 // $userId
	'POST',               // $method
	$params,              // $params
	                      // $options
);

The transcriptions can then be received through websockets in the signaling messages in this format:

{
    "type": "message",
    "message": {
        "sender": {
            "type": "session",
            "sessionid": "7clbmRBx1h3SRsEaN0JOYyzmn49FW5aCwTG1wzaRsg98PT13bDM1Vnp0ZExxZDJfNnllMV9qVW1qdWRDa1BJUi1GTmxENlVUV3JhT2RIQnEwS05FMkR0aFhPSzZyfDUyNjgzODc0NzE="
        },
        "data": {
            "message": "can you see my screen",
            "type": "transcript",
            "speakerSessionId": "LbUfM8FAKjSLi_YtURITJXzJeViF9wKBfae0lKQ4Luh8PT13TXdhanZCRGVGanJoOEhmcHk4dHVvV0NhTk5tRi1IZHVsSkc1b2RtZXRVNElmTlE4ajA5UlN5bmczfDMwNjgzODc0NzE="
        }
    }
}

The message does not contain any punctuations so in this state you might want to take that into account while crafting the UI.

kyteinsky avatar May 21 '25 15:05 kyteinsky

hello, A small status update. The performance of the transcript generation with multiple audio streams on CPU is not ideal. The vosk app does not allow running the batch model on CPU. For GPU (tested with NVIDIA), it seems performant in the initial tests out of the live transcription app. We'll be integrating the GPU mode soon to formally test it.

kyteinsky avatar May 29 '25 08:05 kyteinsky

hello again, The performance issues have been fixed, now it works well with CPU and GPU, the realtime factor being ~4x (11 secs audio in 2.66 secs).

There is an issue where we would like the Talk and the Design team's feedback. The Vosk server, our transcription engine, supports multiple languages (https://alphacephei.com/vosk/models) but each language has a different model file. How do we load a particular kind of model?

  1. It can be picked up from the Nextcloud UI's language - The person might choose to speak one language in one call, other in another without changing the UI's language or the language set in the settings.
  2. We can use a settings option for the speaker to select the language in the dialog before the call
  3. We can detect it from the initial few seconds of the audio - If this is not accurate, the whole transcript from that speaker is broken

So far (2.) seems like the best option to us, what do you think?

cc @marcoambrosini

kyteinsky avatar Jul 03 '25 12:07 kyteinsky

Regarding options 1 and 2, although it would make sense that the speakers themselves are the ones specifying the language that they are speaking, it would be also a bit strange, because they would specify it so others can see transcriptions of what they say, but it would provide no "benefit" for them (of course it could be seen just as a courtesy toward other participants, but still :-P ).

For example, in our company cloud I would set English as the default language, but in a 1-1 with a Spanish colleague I would need to manually change it to Spanish even if I do not even know if the other participant will enable the transcription or not.

In that sense it might make more sense if the participants that enable the transcription set the language, but then it would need to be decided what happens if one participant enables the transcription in one language and another participant, by mistake, enables it in a different language (that is not even being spoken in the conversation). Of course they might fix it and choose the right language later on, but what to do in the meantime?

Moreover, what would happen in multilingual calls? In some cases there could be some participants speaking one language, some participants speaking another language, and then some participants speaking both and acting as translators. For the participants speaking just one language enabling the transcription in the language that they speak would be fine, even if the transcriptions are totally wrong when the other language is being spoken. But for the participants that speak both languages then it would be messy, as they would get the right translations only in a single language. In any case, in multilingual calls, for participants that can speak in more than one language in the same call it would be problematic even if the speakers themselves set the language rather than the participants enabling the transcription :shrug:

But this should be a less common scenario, so maybe it would be something just to keep in mind and address in a future version. Also because transcribing participants that may speak two different languages in the same call seems to be problematic no matter the approach taken :-) (unless there is a cool model that can dynamically and automatically switch between languages ;-) )

danxuliu avatar Jul 04 '25 11:07 danxuliu

It can be picked up from the Nextcloud UI's language - The person might choose to speak one language in one call, other in another without changing the UI's language or the language set in the settings.

Reality here: I'm in calls where people speak German until a person joins that is not speaking German and then they switch mid sentence to English 🙈

We can use a settings option for the speaker to select the language in the dialog before the call

Would need to check with @nimishavijay but generally this sounds like a bad idea.

nickvergessen avatar Jul 04 '25 11:07 nickvergessen

Agreed with @nickvergessen that many times people switch from German to English after I join (thank you 🙈)

  1. It can be picked up from the Nextcloud UI's language - The person might choose to speak one language in one call, other in another without changing the UI's language or the language set in the settings.
  2. We can use a settings option for the speaker to select the language in the dialog before the call
  3. We can detect it from the initial few seconds of the audio - If this is not accurate, the whole transcript from that speaker is broken

Best would always be (3) to auto detect reliably, with a smart default that it falls back to if the auto detection is not reliable.

Even with any of these solutions, it seems like there is still an issue with multilingual calls as it is not possible to switch models when a call is going on if I understand correctly?

Ideally it auto detects the change and switches models, but if it is not at all possible to switch models in the middle, as a last resort we can offer a user setting to switch language while a call in going on, so that people are not stuck with the wrong model for a 1+ hour call. Not ideal but better than forcing people to speak in only a certain language for the sake of a computer.

nimishavijay avatar Jul 04 '25 12:07 nimishavijay

I just communicated 1:1 to Anupam that I would be fine with a first version of live transcription to limit the application to one configurable language by the administrator. We would make sure to document this limitation and work on resolving it in future releases. If you all can agree on an easy solution to support more languages, that is fine of course to implement directly. I suppose auto detect is not an easy solution, however.

DaphneMuller avatar Jul 08 '25 13:07 DaphneMuller

We discussed this issue and the concerns with the integrations team in our AI call. The solutions seem to address most of the points keeping the implementation simple. For some context on the Vosk transcription engine, after some more testing we discovered that loading multiple language models at the same time does not consume more GPU memory, just the RAM, which is relatively much cheaper. GPU memory is initially filled with the transcription engine and then with the concurrent recognizers for each audio stream. The individual recognizers connected to one language model and one audio stream can be utilized again but two concurrently running recognizers take their own space.

The big benefit revealed is that we can load multiple models without affecting the GPU memory so the issue of switching languages mid-call would be seamless if we use a smaller model. Load time for the large English model is 7.3 sec and for the small models < 0.5 sec. The same language model can even be shared between different calls and different speaking participants. For multiple concurrent recognizers/audio streams transcription, they can be limited to a max no. so the VRAM/RAM usage is kept in check.

The main issue of model switching when one/more people change their languages can be solved by having a room-wide language setting since it is rarely the case that two people talk in two different languages in the same room. We can look into supporting that case if there are more instances of that happening. The option to change this can be in the top bar near the mic icon or and the other options, controllable by the moderators. All the other users can see the option as a read-only information. This should solve Daniel's concern with the handing of the control to the speaker/participant, and Joas' too since the option can be changed when the spoken language is changed in the room, and a moderator changes the setting for the whole room. Individual participants can choose to receive transcriptions in a different button or the same button could be toggleable.

As Daphne said and we concluded in our AI call, no matter how we slice it, auto detection invites complex implementations and still would fail to work in many cases. If the user has to reach for the manual option often then making that the only option should make more sense.


Some technical details about Vosk in case someone is interested:

Specific Language Model Uncompressed Size
English (large) 2.7 GiB
English (small/lgraph) 205 MiB
Hindi (small) 79 MiB
German (small) 92 MiB
French (small) 66 MiB
Spanish (small) 58 MiB

GPU: NVIDIA 4060 Ti 16 GiB CPU: i5 10400 with 6 threads RAM: 14 GiB Environment: Program runs in docker in a LXC in Proxmox

VRAM remains constant when loading any amount of models, it only carries the main logic and the recognizers for each audio stream as allowed by the maximum no. of threads. All the loaded models take up space in the RAM. Stress testing with 200 requests (1) in a sequential manner and (2) in batches of 20 concurrent requests reveal that there is absolutely no memory leak in VRAM but for RAM it was inconclusive, we might need real-world testing to really confirm this.

We use a thread pool to run the transcription, it limits the concurrency of the transcription process and the max VRAM/RAM used. The initial load takes up 8066 MiB of VRAM regardless of no. and sizes of the models loaded. First connection takes up additional 44 MiB of VRAM. 4 connections take up additional 180 MiB of VRAM, i.e. 45 MiB per thread on average. 10 connections take up additional 542 MiB of VRAM, i.e. 54.2 MiB per thread on average.

^^ These values vary slightly sometimes with concurrent connections but don't fluctuate after that, unless more concurrent connections are served.

RAM is generally plentiful in servers but for a general idea, using GPU, RAM usage when

  1. English (large) is loaded: 2.9 GiB, one connection using the model: 2.95 GiB
  2. All the above listed models are loaded: 3.5 GiB, on one connection using all the models: 3.9 GiB

without GPU, RAM usage when

  1. English (large) is loaded: 5.0 GiB, one connection using the model: 5.1 GiB
  2. All the above listed models are loaded: 5.7 GiB, on one connection using all the models: 6.05 GiB

Official values for 1000 concurrent sessions: "10 CPU servers of 48 cores each or 5 GPU servers with RTX4090" https://github.com/alphacep/vosk-server/issues/259#issuecomment-2275002718

kyteinsky avatar Jul 11 '25 06:07 kyteinsky