parler-tts Using sdpa and flash_attention

Hello, thanks for this great job! I followed the instructions INFERENCE , but encountered some difficulties.

from parler_tts import ParlerTTSForConditionalGeneration
import torch
from transformers import AutoTokenizer
import soundfile as sf


torch_device = "cuda:0" # use "mps" for Mac
torch_dtype = torch.float32
model_name = "parler-tts/parler-tts-mini-v1"

attn_implementation = "sdpa" # "sdpa" or "flash_attention_2"

model = ParlerTTSForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch_dtype, attn_implementation=attn_implementation).to(torch_device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Hey, how are you doing today?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(torch_device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(torch_device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().to(torch.float32).numpy().squeeze()

sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

when I set attn_implementation="sdpa"，get an error

ValueError: T5EncoderModel does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: https://github.com/huggingface/transformers/issues/28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument `attn_implementation="eager"` meanwhile. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")`

and set attn_implementation="flash_attention_2"，get an error

ValueError: T5EncoderModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/google/flan-t5-large/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

I use A100 GPU, my environment is:

transformers                4.46.1
torch                       2.3.0
flash-attn                  2.5.8

Am I missing some important configuration information?

Nov 22 '24 03:11 aixingxy

i am encountering this error too. Appreciate if there can be any help

Nov 29 '24 01:11 remichu-ai

来函妥收。

Nov 29 '24 01:11 aixingxy

I am getting the same issue, I also followed the tutorial exactly.

Nov 29 '24 21:11 jack-richards

@aixingxy I am getting the same issue It was working before on an old installation that I have on Conda it seems there was some update that made it happen as I installed a new one and got it this week. I advise you to use the default one as when I tested all of them on 3090 and L40s I didnt see much difference in speed.

Dec 11 '24 15:12 lukaLLM

Bumping transformers version to 4.48.0 solved the problem

Jan 12 '25 17:01 bzikst

Thanks @bzikst . Works also for me.

Feb 27 '25 08:02 farzanehnakhaee70

@bzikst bumping the transformers version to 4.48.0 solved the problem, but I see no difference in inference speed using all 3 implementations ("eager", "sdpa", "flash-attention-2"). I installed Flash Attention 2 using "pip install flash-attn==2.0.2". Am I missing something? The README says changing attention implementations significantly increases the inference speed.

Apr 24 '25 06:04 AbhisarJ

来函妥收。

Apr 24 '25 06:04 aixingxy

parler-tts
parler-tts copied to clipboard

Using sdpa and flash_attention_2 error

parler-tts parler-tts copied to clipboard

Using sdpa and flash_attention_2 error

parler-tts
parler-tts copied to clipboard