agents Long time interval when synthesizing Chinese text-to-speech

I have encountered an issue with the voice assistant when synthesizing Chinese text. The time interval between LLM and synthesized speech outputs is noticeably longer when the output is in Chinese compared to English. This issue does not occur when the output is in English, where the speech synthesis proceeds without any delay.

I noticed that the TTS speech synthesis almost always starts only after the LLM output has fully completed. I think there is something wrong with tokenizers.

2024-10-16 21:01:46,321 - DEBUG livekit.agents.pipeline - synthesizing agent reply {"speech_id": "ed7b151599e7", "elapsed": 1.509}
2024-10-16 21:01:47,311 - DEBUG livekit.agents.pipeline - received first LLM token {"speech_id": "ed7b151599e7", "elapsed": 0.988}
2024-10-16 21:02:03,955 - DEBUG livekit.agents.pipeline - received first TTS frame {"speech_id": "ed7b151599e7", "elapsed": 16.644, "streamed": true}

That's my code,

def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()

async def entrypoint(ctx: JobContext):
    initial_ctx = llm.ChatContext().append(
        role="system",
        text=(
            """你是一个语音助手。你与用户的交互将通过语音进行。
你应该使用简短明了的回答，注意使用正确的标点符号断句。"""
        ),
    )

    logger.info(f"connecting to room {ctx.room.name}")
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    # wait for the first participant to connect
    participant = await ctx.wait_for_participant()
    logger.info(f"starting voice assistant for participant {participant.identity}")

    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=openai.STT(base_url=OPENAI_BASEURL, language="auto"),
        llm=openai.LLM(base_url=OPENAI_BASEURL, model=MODEL_NAME),
        tts=openai.TTS(base_url=OPENAI_BASEURL),
        transcription=AgentTranscriptionOptions(sentence_tokenizer=nltk.SentenceTokenizer(min_sentence_len=5)),
        chat_ctx=initial_ctx,
    )

    agent.start(ctx.room, participant)
    chat = rtc.ChatManager(ctx.room)

    async def answer_from_text(txt: str):
        chat_ctx = agent.chat_ctx.copy()
        chat_ctx.append(role="user", text=txt)
        stream = agent.llm.chat(chat_ctx=chat_ctx)
        await agent.say(stream)

    @chat.on("message_received")
    def on_chat_received(msg: rtc.ChatMessage):
        logger.info(msg)
        if msg.message:
            asyncio.create_task(answer_from_text(msg.message))

    await agent.say("你好，需要我的帮助吗？", allow_interruptions=True)


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

Oct 16 '24 13:10 zhanghx0905

I found https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/tokenize/_basic_sent.py. Maybe I need a chinese version of it

Oct 16 '24 14:10 zhanghx0905

Hey yes, _basic_sent would need to be edited. I think in the best world _basic_sent also work for Chinese. Is ， the main character to look for when splitting sentences?

Oct 16 '24 18:10 theomonnom

@zhanghx0905 are you interested in helping to make this better for Chinese? I think we'd need chinese period 。

Oct 16 '24 18:10 davidzhao

@zhanghx0905 are you interested in helping to make this better for Chinese? I think we'd need chinese period 。

I will make some attempts and see what I can do for this issue.

Oct 17 '24 03:10 zhanghx0905

TEN-Agent is founded by a Chinese team. I checked their implementation, and it looks not complicated, calling TTS when special symbol is matched.

    self.sentence_expr = re.compile(r".+?[,，.。!！?？:：]", re.DOTALL)

https://github.com/TEN-framework/TEN-Agent/blob/41d1a263f910916930b43cecb5278d26883c6a71/agents/ten_packages/extension/qwen_llm_python/qwen_llm_extension.py#L39C8-L39C71

I implemented similar logic for the livekit agent:

import functools
import re
from dataclasses import dataclass
from typing import List, Tuple

from livekit.agents.tokenize import token_stream, tokenizer

_sentence_pattern = re.compile(r".+?[,，.。!！?？:：]", re.DOTALL)


@dataclass
class _TokenizerOptions:
    language: str
    min_sentence_len: int
    stream_context_len: int


class ChineseSentenceTokenizer(tokenizer.SentenceTokenizer):
    def __init__(
        self,
        *,
        language: str = "chinese",
        min_sentence_len: int = 10,
        stream_context_len: int = 10,
    ) -> None:
        self._config = _TokenizerOptions(
            language=language,
            min_sentence_len=min_sentence_len,
            stream_context_len=stream_context_len,
        )

    def tokenize(self, text: str, *, language: str | None = None) -> List[str]:
        sentences = self.chinese_sentence_segmentation(text)
        return [sentence[0] for sentence in sentences]

    def stream(self, *, language: str | None = None) -> tokenizer.SentenceStream:
        return token_stream.BufferedSentenceStream(
            tokenizer=functools.partial(self.chinese_sentence_segmentation),
            min_token_len=self._config.min_sentence_len,
            min_ctx_len=self._config.stream_context_len,
        )

    def chinese_sentence_segmentation(self, text: str) -> List[Tuple[str, int, int]]:
        result = []
        start_pos = 0

        for match in _sentence_pattern.finditer(text):
            sentence = match.group(0)
            end_pos = match.end()
            sentence = sentence.strip()
            if sentence:
                result.append((sentence, start_pos, end_pos))
            start_pos = end_pos

        if start_pos < len(text):
            sentence = text[start_pos:].strip()
            if sentence:
                result.append((sentence, start_pos, len(text)))

        return result

You can use this class as following,

    agent = VoicePipelineAgent(
        # ...
        transcription=AgentTranscriptionOptions(
            sentence_tokenizer=ChineseSentenceTokenizer(),
        ),
    )

Oct 26 '24 14:10 zhanghx0905

I have more questions. How does WordTokenizer work? Should I implement a Chinese version？ What does preemptive_synthesis=True mean? @davidzhao

Oct 26 '24 16:10 zhanghx0905

WordTokenizer is used to sync realtime transcriptions (so we'd emit it word by word). For Chinese, it's really just splitting each unicode char by itself.

Oct 28 '24 01:10 davidzhao

I think I've finally figured out what most of the parameters really mean, and here's my final solution

from livekit.agents import tts as _tts

    tts = _tts.StreamAdapter(
        tts=openai.TTS(base_url=OPENAI_BASEURL),
        sentence_tokenizer=ChineseSentenceTokenizer(min_sentence_len=10),
    )
    agent = VoicePipelineAgent(
       #...
        tts=tts,
    )

Oct 28 '24 13:10 zhanghx0905

agents agents copied to clipboard

Long time interval when synthesizing Chinese text-to-speech

agents
agents copied to clipboard