parler-tts icon indicating copy to clipboard operation
parler-tts copied to clipboard

Benchmarks of parler-tts, the emergence of TTS!

Open BBC-Esq opened this issue 4 months ago • 7 comments

Hey @sanchit-gandhi, like the repo. Excited to see this being worked on. Here's a benchmark of WhisperSpeech. I used your sample script on the same exact text snippet and it finished processing in 16.04 seconds. However, this repo is in float32 while I think WhisperSpeech is being run in float16. Can you provide me with the modification to run in float16 or bfloat16 even? I'm going to do a comparison of this, Bark, and WhisperSpeech:

image

I want to add that this says nothing about the quality, only speed. I'll evaluate quality next after I ensure comparable testing procedures regarding compute time. Here's the script I used:

SCRIPT HERE
import time
import sounddevice as sd
import torch
from transformers import AutoTokenizer
from parler_tts import ParlerTTSForConditionalGeneration

# Setup device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

# Prepare input
prompt = "This script processes a body of text one sentence at a time and plays them consecutively. This enables the audio playback to begin sooner instead of waiting for the entire body of text to be processed. The script uses the threading and queue modules that are part of the standard Python library. It also uses the sound device library, which is fairly reliable across different platforms. I hope you enjoy, and feel free to modify or distribute at your pleasure."
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

# Start timer
start_time = time.time()

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

# End timer
end_time = time.time()
processing_time = end_time - start_time

# Print processing time in green
print(f"\033[92mProcessing time: {processing_time:.2f} seconds\033[0m")

sampling_rate = model.config.sampling_rate
sd.play(audio_arr, samplerate=sampling_rate)
sd.wait()

Lastly, let me know what other "speedups" I can use such as bettertransformer, which I think is part of torch now unless I'm mistaken. I can't test FA2 unless you help me install it. I've tried.

BBC-Esq avatar Apr 12 '24 23:04 BBC-Esq