WhisperSpeech
WhisperSpeech copied to clipboard
How to load pretrained model in local?
Hello, This project is awesome! but I have a small problem with using this project.
I wanted to download and use the 'pretrained-model'("s2a-q4-small-en+pl.model") to apply this to the existing project.
An error occurs in the process of downloading the 'pretrained-model' you provided and handing over the parameter to ref as 'local_filename' through Pipeline.
[code]
from whisperspeech.pipeline import Pipeline
en_text_prompt = "Hello? I'm calling to reserve a room, but is there a room left?"
pipe = Pipeline(s2a_ref={'local_filename': "s2a-q4-tiny-en+pl.model"})
pipe.generate_to_file(file_path, en_text_prompt)
print("WhisperSpeech Test Done!")
[error]
AttributeError: 'dict' object has no attribute 'seek'. You can only torch.load from a file that is seekable.
Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.
Do you have any instructions or guides to use the 'pretrained-model' you provided? I looked up the code but only load_model of 't2s' and 's2a' class was valid for me.
Thank you for your research and contribution!
Hey, you can pass the file name as a string, like this:
pipe = Pipeline(s2a_ref="s2a-q4-tiny-en+pl.model")
If you want to avoid downloading anything automatically you'll need to download and pass in t2s_ref in the same way. And you may need to download Vocos and Encodec. We have an example script for Docker here:
https://github.com/collabora/WhisperFusion/blob/main/docker/scripts/setup-whisperfusion.sh#L19-L23
Here's another option in a script I made which works alright. The text to be spoken hardcoded into the script for testing purposes, but it has the necessary structure you're looking for I believe, enough to get started:
from pydub import AudioSegment
import numpy as np
from whisperspeech.pipeline import Pipeline
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-small-en+pl.model')
pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-base-en+pl.model')
audio_tensor = pipe.generate("""
According to the provided context from Georgia Juvenile Practice and Procedure with Forms the preliminary protective hearing in a dependency case must be held promptly after a child is removed from the home and no later than 72 hours after the child is placed in foster care. If this 72-hour time frame expires on a weekend or legal holiday, the hearing should be scheduled for no later than the next day that is not a weekend or legal holiday.
""")
# generate uses CUDA if available; therefore, it's necessary to move to CPU before converting to NumPy array
audio_np = (audio_tensor.cpu().numpy() * 32767).astype(np.int16)
if len(audio_np.shape) == 1:
audio_np = np.expand_dims(audio_np, axis=0)
else:
audio_np = audio_np.T
print("Array shape:", audio_np.shape)
print("Array dtype:", audio_np.dtype)
try:
audio_segment = AudioSegment(
audio_np.tobytes(),
frame_rate=24000,
sample_width=2,
channels=1
)
audio_segment.export('output_audio.wav', format='wav')
print("Audio file generated: output_audio.wav")
except Exception as e:
print(f"Error writing audio file: {e}")
It relies on pydub instead of the standard way that text is processed by the source code, however, just FYI.