TTS
TTS copied to clipboard
Error when passing a custom list with strings as text when `split_sentences=False`.
Describe the bug
When passing a list of custom split sentences using a custom split function, the TTS model (tts_models/multilingual/multi-dataset/xtts_v2
to be specific) with split_sentences=False
throws following error:
sent = sent.strip().lower()
^^^^^^^^^^
AttributeError: 'list' object has no attribute 'strip'
After some fiddling around in the TTS code, I noticed that the synthesizer.tts
function (in TTS.utils
) always assumes the input is a string and not a list of strings (which is essential when custom split function needs to be used). This is the case regardless of split_sentences
param is False or True, even though for split_sentences=True a list of strings is not expected as that is done internally.
To Reproduce
from TTS.api import TTS
import torch
# This example list of strings is normally generated by a custom splitting function.
example = [ "This is a sample sentence.", "Another sample sentence." ]
dev = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(dev)
tts.tts_to_file(text=example, speaker_wav="any/sample/wav/here", language="en", file_path="test.wav", split_sentences=False)
Expected behavior
I expect that a list of string given as text should be acceptable. So passing a list of strings in place of text
should be acceptable when split_sentences=False
is used.
Logs
No response
Environment
- TTS version: 0.22.0
- Pytorch version: 2.3.0
- Python version: 3.11.9
- OS: Win 11
- CUDA version: 12.1
- installed pytorch using `python -m pip install torch==2.3.0 torchaudio==2.3.0 -i https://download.pytorch.org/whl/cu121`
Additional context
An Idea on how to solve this issue:
1- Use the tokenizer to check how many tokens are acceptable at once (assuming text argument is a string). If the text doesn't fit with max. context acceptable by the model, split sentences using a custom provided function (in case of split_sentences=False
) or split it using existing internal function for splitting sentences (in case of split_sentences=True
). So tts_to_file
function or any other function used to synthesize TTS should accept a param called e.g. "custom_split_fn" in case split_sentences=False
. In this case text
can always stay as a string.
2-
if text:
sens = [text]
if split_sentences:
print(" > Text splitted to sentences.")
sens = self.split_into_sentences(text)
print(sens)
In TTS/utils/synthesizer.py
should be temporarily changed to until a cleaner solution (idea noted above or similar) is implemented:
if text:
if isinstance(text, str):
sens = [text]
elif isinstance(text, list):
sens = text
else:
raise ValueError(f"{text} is not of type string or list")
if split_sentences:
print(" > Text splitted to sentences.")
sens = self.split_into_sentences(text)
print(sens)
Also, is there a reason to use print
instead of a logger? Why is sens printed here? IMO this should only be acceptable during debugging.