pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

Meeting translation pipeline suggestion

Open ILG2021 opened this issue 8 months ago • 4 comments

Question

I have a app for meeting translation, one speak French and one speak English, when the agent hear French, it should translate to English and speak out. When it hear English it should translate to French and speak out. How to implement it? I found it is hard to implement with Gladia speech translate(to text) + tts or Gemini Flash speech translate(to text) + tts. If I use STT + LLM + TTS, it is easy, because STT can autodetect the speak language and give it on Transcribe Frame, but it is time and token cost. I think two level implement is better but it is hard to ensure the source language and target language. Any suggestion for this situation?

ILG2021 avatar May 05 '25 23:05 ILG2021

The trickiest part of the problem is detecting which language is being spoken in the first place. The best solution would be to select an STT solution that can transcribe either french or english. Then, in your LLM step, you could have the LLM do the language swapping: when it hears english, output french and vice versa. It may also be helpful to have the LLM encode which language is being spoken into the text, e.g. . You can use that encoded text to determine which language and voice to have the TTS use.

There's some complexity here, but it's very doable with Pipecat.

markbackman avatar May 06 '25 01:05 markbackman

The trickiest part of the problem is detecting which language is being spoken in the first place. The best solution would be to select an STT solution that can transcribe either french or english. Then, in your LLM step, you could have the LLM do the language swapping: when it hears english, output french and vice versa. It may also be helpful to have the LLM encode which language is being spoken into the text, e.g. . You can use that encoded text to determine which language and voice to have the TTS use.

There's some complexity here, but it's very doable with Pipecat.

Ok seems three level architecture can not avoid. If I know the user id who wanna English to French, also the user id who wanna French to English, can I design two pipelines, one track English speak and another track French?

ILG2021 avatar May 06 '25 08:05 ILG2021

One pipeline should suffice. You could also use a multimodal LLM like Gemini Live for this to take audio in and translate. I would imagine it will be up for the task.

markbackman avatar May 06 '25 12:05 markbackman

One pipeline should suffice. You could also use a multimodal LLM like Gemini Live for this to take audio in and translate. I would imagine it will be up for the task.

reluctantly, gemini live's tts does support multiple languages. what's the worse, its 15 minutes force cut off (maybe 10 minutes in real life test) websocket is a pain to use.

ILG2021 avatar May 07 '25 22:05 ILG2021

@ILG2021 we introduced the translation in the latest release

jqueguiner avatar Jul 08 '25 14:07 jqueguiner

Yes! GladiaSTTService has a really nice implementation that makes translation easy. In fact, Pipecat now yields a TranslationFrame in addition to the TranscriptionFrame when enabling this for their service. Check out the following example: https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/13c-gladia-translation.py

I'm going to close out this issue.

markbackman avatar Jul 08 '25 14:07 markbackman