TTS icon indicating copy to clipboard operation
TTS copied to clipboard

Error when passing a custom list with strings as text when `split_sentences=False`.

Open Mo-MR-123 opened this issue 7 months ago β€’ 9 comments

Describe the bug

When passing a list of custom split sentences using a custom split function, the TTS model (tts_models/multilingual/multi-dataset/xtts_v2 to be specific) with split_sentences=False throws following error:

sent = sent.strip().lower()
           ^^^^^^^^^^
AttributeError: 'list' object has no attribute 'strip'

After some fiddling around in the TTS code, I noticed that the synthesizer.tts function (in TTS.utils) always assumes the input is a string and not a list of strings (which is essential when custom split function needs to be used). This is the case regardless of split_sentences param is False or True, even though for split_sentences=True a list of strings is not expected as that is done internally.

To Reproduce

from TTS.api import TTS
import torch

# This example list of strings is normally generated by a custom splitting function.
example = [ "This is a sample sentence.", "Another sample sentence." ]

dev = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(dev)
tts.tts_to_file(text=example, speaker_wav="any/sample/wav/here", language="en", file_path="test.wav", split_sentences=False)

Expected behavior

I expect that a list of string given as text should be acceptable. So passing a list of strings in place of text should be acceptable when split_sentences=False is used.

Logs

No response

Environment

- TTS version: 0.22.0
- Pytorch version: 2.3.0
- Python version: 3.11.9
- OS: Win 11
- CUDA version: 12.1
- installed pytorch using `python -m pip install torch==2.3.0 torchaudio==2.3.0 -i https://download.pytorch.org/whl/cu121`

Additional context

An Idea on how to solve this issue:

1- Use the tokenizer to check how many tokens are acceptable at once (assuming text argument is a string). If the text doesn't fit with max. context acceptable by the model, split sentences using a custom provided function (in case of split_sentences=False) or split it using existing internal function for splitting sentences (in case of split_sentences=True). So tts_to_file function or any other function used to synthesize TTS should accept a param called e.g. "custom_split_fn" in case split_sentences=False. In this case text can always stay as a string.

2-

if text:
            sens = [text]
            if split_sentences:
                print(" > Text splitted to sentences.")
                sens = self.split_into_sentences(text)
            print(sens)

In TTS/utils/synthesizer.py should be temporarily changed to until a cleaner solution (idea noted above or similar) is implemented:

if text:
            if isinstance(text, str):
                sens = [text]
            elif isinstance(text, list):
                sens = text
            else:
                raise ValueError(f"{text} is not of type string or list")
            
            if split_sentences:
                print(" > Text splitted to sentences.")
                sens = self.split_into_sentences(text)
            print(sens)

Also, is there a reason to use print instead of a logger? Why is sens printed here? IMO this should only be acceptable during debugging.

Mo-MR-123 avatar Jul 15 '24 13:07 Mo-MR-123