agents
agents copied to clipboard
Long time interval when synthesizing Chinese text-to-speech
I have encountered an issue with the voice assistant when synthesizing Chinese text. The time interval between LLM and synthesized speech outputs is noticeably longer when the output is in Chinese compared to English. This issue does not occur when the output is in English, where the speech synthesis proceeds without any delay.
I noticed that the TTS speech synthesis almost always starts only after the LLM output has fully completed. I think there is something wrong with tokenizers.
2024-10-16 21:01:46,321 - DEBUG livekit.agents.pipeline - synthesizing agent reply {"speech_id": "ed7b151599e7", "elapsed": 1.509}
2024-10-16 21:01:47,311 - DEBUG livekit.agents.pipeline - received first LLM token {"speech_id": "ed7b151599e7", "elapsed": 0.988}
2024-10-16 21:02:03,955 - DEBUG livekit.agents.pipeline - received first TTS frame {"speech_id": "ed7b151599e7", "elapsed": 16.644, "streamed": true}
That's my code,
def prewarm(proc: JobProcess):
proc.userdata["vad"] = silero.VAD.load()
async def entrypoint(ctx: JobContext):
initial_ctx = llm.ChatContext().append(
role="system",
text=(
"""你是一个语音助手。你与用户的交互将通过语音进行。
你应该使用简短明了的回答,注意使用正确的标点符号断句。"""
),
)
logger.info(f"connecting to room {ctx.room.name}")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
# wait for the first participant to connect
participant = await ctx.wait_for_participant()
logger.info(f"starting voice assistant for participant {participant.identity}")
agent = VoicePipelineAgent(
vad=ctx.proc.userdata["vad"],
stt=openai.STT(base_url=OPENAI_BASEURL, language="auto"),
llm=openai.LLM(base_url=OPENAI_BASEURL, model=MODEL_NAME),
tts=openai.TTS(base_url=OPENAI_BASEURL),
transcription=AgentTranscriptionOptions(sentence_tokenizer=nltk.SentenceTokenizer(min_sentence_len=5)),
chat_ctx=initial_ctx,
)
agent.start(ctx.room, participant)
chat = rtc.ChatManager(ctx.room)
async def answer_from_text(txt: str):
chat_ctx = agent.chat_ctx.copy()
chat_ctx.append(role="user", text=txt)
stream = agent.llm.chat(chat_ctx=chat_ctx)
await agent.say(stream)
@chat.on("message_received")
def on_chat_received(msg: rtc.ChatMessage):
logger.info(msg)
if msg.message:
asyncio.create_task(answer_from_text(msg.message))
await agent.say("你好,需要我的帮助吗?", allow_interruptions=True)
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))
I found https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/tokenize/_basic_sent.py. Maybe I need a chinese version of it
Hey yes, _basic_sent would need to be edited. I think in the best world _basic_sent also work for Chinese.
Is , the main character to look for when splitting sentences?
@zhanghx0905 are you interested in helping to make this better for Chinese? I think we'd need chinese period 。
@zhanghx0905 are you interested in helping to make this better for Chinese? I think we'd need chinese period
。
I will make some attempts and see what I can do for this issue.
TEN-Agent is founded by a Chinese team. I checked their implementation, and it looks not complicated, calling TTS when special symbol is matched.
self.sentence_expr = re.compile(r".+?[,,.。!!??::]", re.DOTALL)
https://github.com/TEN-framework/TEN-Agent/blob/41d1a263f910916930b43cecb5278d26883c6a71/agents/ten_packages/extension/qwen_llm_python/qwen_llm_extension.py#L39C8-L39C71
I implemented similar logic for the livekit agent:
import functools
import re
from dataclasses import dataclass
from typing import List, Tuple
from livekit.agents.tokenize import token_stream, tokenizer
_sentence_pattern = re.compile(r".+?[,,.。!!??::]", re.DOTALL)
@dataclass
class _TokenizerOptions:
language: str
min_sentence_len: int
stream_context_len: int
class ChineseSentenceTokenizer(tokenizer.SentenceTokenizer):
def __init__(
self,
*,
language: str = "chinese",
min_sentence_len: int = 10,
stream_context_len: int = 10,
) -> None:
self._config = _TokenizerOptions(
language=language,
min_sentence_len=min_sentence_len,
stream_context_len=stream_context_len,
)
def tokenize(self, text: str, *, language: str | None = None) -> List[str]:
sentences = self.chinese_sentence_segmentation(text)
return [sentence[0] for sentence in sentences]
def stream(self, *, language: str | None = None) -> tokenizer.SentenceStream:
return token_stream.BufferedSentenceStream(
tokenizer=functools.partial(self.chinese_sentence_segmentation),
min_token_len=self._config.min_sentence_len,
min_ctx_len=self._config.stream_context_len,
)
def chinese_sentence_segmentation(self, text: str) -> List[Tuple[str, int, int]]:
result = []
start_pos = 0
for match in _sentence_pattern.finditer(text):
sentence = match.group(0)
end_pos = match.end()
sentence = sentence.strip()
if sentence:
result.append((sentence, start_pos, end_pos))
start_pos = end_pos
if start_pos < len(text):
sentence = text[start_pos:].strip()
if sentence:
result.append((sentence, start_pos, len(text)))
return result
You can use this class as following,
agent = VoicePipelineAgent(
# ...
transcription=AgentTranscriptionOptions(
sentence_tokenizer=ChineseSentenceTokenizer(),
),
)
I have more questions.
How does WordTokenizer work? Should I implement a Chinese version?
What does preemptive_synthesis=True mean? @davidzhao
WordTokenizer is used to sync realtime transcriptions (so we'd emit it word by word). For Chinese, it's really just splitting each unicode char by itself.
I think I've finally figured out what most of the parameters really mean, and here's my final solution
from livekit.agents import tts as _tts
tts = _tts.StreamAdapter(
tts=openai.TTS(base_url=OPENAI_BASEURL),
sentence_tokenizer=ChineseSentenceTokenizer(min_sentence_len=10),
)
agent = VoicePipelineAgent(
#...
tts=tts,
)