StyleTTS2 icon indicating copy to clipboard operation
StyleTTS2 copied to clipboard

add speed option to inference

Open mrciolino opened this issue 1 year ago • 1 comments

Add a speed option to the infer function https://github.com/sidharthrajaram/StyleTTS2/blob/350b8889d75a52c2c695dafa7ba4682386be0b96/src/styletts2/tts.py#L186-L195

Have tested with dividing the predicted duration by a value https://github.com/sidharthrajaram/StyleTTS2/blob/350b8889d75a52c2c695dafa7ba4682386be0b96/src/styletts2/tts.py#L267-L267

Adding speed=1 to the function and / speed to the predicted duration gives the following duration of .wav files for various speeds. Speeds b/w .75 and 1.75 sound good but outside of that is rough.

duration = torch.sigmoid(duration).sum(axis=-1) / speed
    def inference(self,
                  text: str,
                  target_voice_path=None,
                  output_wav_file=None,
                  output_sample_rate=24000,
                  alpha=0.3,
                  beta=0.7,
                  diffusion_steps=5,
                  embedding_scale=1,
                  speed=1,
                  ref_s=None):

image Orange line is duration of the original clip divided by the speed parameter. Blue line is the duration of the clip produced when the speed parameter was used.

Had to convert to mp4 to play on here:

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/fd021f78-2861-4357-8d9d-f129914ec99e

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/0cd0c426-82c3-468c-a870-f1f7722c3bba

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/e2c1a315-792d-41d4-b597-45116359ce7d

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/82531bfd-824d-42e4-a6a1-5b869439cd08

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/c1458065-be2c-4dc5-952d-166ed2e5114a

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/3863b98e-1d8d-4ca6-8afa-e9cb075bc6f9

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/bc878ebb-0607-4c22-9711-0ad9644a3ecb

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/37107bc5-d19f-41b8-8aa8-3e420a70832b

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/94c15d23-057b-4299-b830-287f015e026f

https://github.com/sidharthrajaram/StyleTTS2/assets/39249797/482e385b-db15-4534-a78f-432a4af49a70

And here is the code I ran to test that after adding in those changes:

import matplotlib.pyplot as plt
from styletts2 import tts
import numpy as np
import librosa

# No paths provided means default checkpoints/configs will be downloaded/cached.
my_tts = tts.StyleTTS2()

# Optionally create/write an output WAV file.
speed_range = np.linspace(0.5, 2, 10)
for speed in speed_range:
    out = my_tts.inference(
        "Hello there, I am now a python package.",
        output_wav_file=f"test_{speed:.2f}.wav",
        speed=speed,
    )

# plot speed vs duration
durations = {}
for speed in speed_range:
    duration = librosa.get_duration(path=f"test_{speed:.2f}.wav")
    print(f"test_{speed:.2f}.wav: {duration:.2f}s")
    durations[speed] = duration


# using 1 as default plot a perfect line by division
expected_durations = [durations[1] / speed for speed in speed_range]
plt.plot(speed_range, list(durations.values()), label="Actual")
plt.plot(speed_range, expected_durations, label="Expected")
plt.xlabel("Speed")
plt.ylabel("Duration")
plt.show()

mrciolino avatar Jan 12 '24 17:01 mrciolino