parler-tts icon indicating copy to clipboard operation
parler-tts copied to clipboard

[show and tell] apple mps support

Open bghira opened this issue 4 months ago • 7 comments

with newer pytorch (2.4 nightly) we get bfloat16 support in MPS.

i tested this:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "mps:0"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device=device, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "welcome to huggingface"
description = "An old man."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device=device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device=device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.to(torch.float32).cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

bghira avatar Apr 10 '24 19:04 bghira

That's awesome, thanks for sharing @bghira! How fast was inference on your local machine?

sanchit-gandhi avatar Apr 11 '24 11:04 sanchit-gandhi

it gets slower as the sample size increases but this test script takes about 10 seconds to run on an M3 Max.

bghira avatar Apr 11 '24 12:04 bghira

I got this working as well! Inference time seems to increase more than linearly with prompt size

  • 3 seconds of audio: 10 seconds of generation
  • 8s of audio: ~90 seconds of generation
  • 10 of audio: ~3min of generation

I think the reason is that itself takes a surprising amount of memory — loading the model takes the expected ~3GB of memory, but then inference takes 15 GB on top of that, which is probably what's slowing it down on my machine (16GB M2).

maxtheman avatar Apr 12 '24 02:04 maxtheman

I got this working as well! Inference time seems to increase more than linearly with prompt size

  • 3 seconds of audio: 10 seconds of generation
  • 8s of audio: ~90 seconds of generation
  • 10 of audio: ~3min of generation

I think the reason is that itself takes a surprising amount of memory — loading the model takes the expected ~3GB of memory, but then inference takes 15 GB on top of that, which is probably what's slowing it down on my machine (16GB M2).

Swapping activated? I will try on Mac Mini M2 (24GB). Do we know the performance on CUDA on similar machine?

QueryType avatar Apr 12 '24 02:04 QueryType

on the 128gb M3 Max i can get pretty far into the output window before the time increases to 3 minutes.

it'll take about a minute for 30 seconds of audio.

bghira avatar Apr 12 '24 02:04 bghira

of

I am getting, 2s of audio: 11 seconds and 6s of audio: 36 seconds

QueryType avatar Apr 12 '24 13:04 QueryType

my data , on 64G M2 Max

seconds of audio cpu(seconds of generation) mps(seconds of generation)
1 7 10
3 13 17
7 30 44
9 41 194
18 71 308

janewu77 avatar Apr 15 '24 09:04 janewu77