WhisperSpeech icon indicating copy to clipboard operation
WhisperSpeech copied to clipboard

How to load pretrained model in local?

Open Melona-BS opened this issue 1 year ago • 2 comments

Hello, This project is awesome! but I have a small problem with using this project.

I wanted to download and use the 'pretrained-model'("s2a-q4-small-en+pl.model") to apply this to the existing project.

An error occurs in the process of downloading the 'pretrained-model' you provided and handing over the parameter to ref as 'local_filename' through Pipeline.

[code] image

from whisperspeech.pipeline import Pipeline
en_text_prompt = "Hello? I'm calling to reserve a room, but is there a room left?"

pipe = Pipeline(s2a_ref={'local_filename': "s2a-q4-tiny-en+pl.model"})
pipe.generate_to_file(file_path, en_text_prompt)

print("WhisperSpeech Test Done!")

[error] image

AttributeError: 'dict' object has no attribute 'seek'. You can only torch.load from a file that is seekable. 
Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

Do you have any instructions or guides to use the 'pretrained-model' you provided? I looked up the code but only load_model of 't2s' and 's2a' class was valid for me.

Thank you for your research and contribution!

Melona-BS avatar Feb 01 '24 03:02 Melona-BS

Hey, you can pass the file name as a string, like this:

pipe = Pipeline(s2a_ref="s2a-q4-tiny-en+pl.model")

If you want to avoid downloading anything automatically you'll need to download and pass in t2s_ref in the same way. And you may need to download Vocos and Encodec. We have an example script for Docker here: https://github.com/collabora/WhisperFusion/blob/main/docker/scripts/setup-whisperfusion.sh#L19-L23

jpc avatar Feb 01 '24 12:02 jpc

Here's another option in a script I made which works alright. The text to be spoken hardcoded into the script for testing purposes, but it has the necessary structure you're looking for I believe, enough to get started:

from pydub import AudioSegment
import numpy as np
from whisperspeech.pipeline import Pipeline

# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-small-en+pl.model')
pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-base-en+pl.model')

audio_tensor = pipe.generate("""
 According to the provided context from Georgia Juvenile Practice and Procedure with Forms the preliminary protective hearing in a dependency case must be held promptly after a child is removed from the home and no later than 72 hours after the child is placed in foster care. If this 72-hour time frame expires on a weekend or legal holiday, the hearing should be scheduled for no later than the next day that is not a weekend or legal holiday.
""")

# generate uses CUDA if available; therefore, it's necessary to move to CPU before converting to NumPy array
audio_np = (audio_tensor.cpu().numpy() * 32767).astype(np.int16)

if len(audio_np.shape) == 1:
    audio_np = np.expand_dims(audio_np, axis=0)
else:
    audio_np = audio_np.T

print("Array shape:", audio_np.shape)
print("Array dtype:", audio_np.dtype)

try:
    audio_segment = AudioSegment(
        audio_np.tobytes(), 
        frame_rate=24000, 
        sample_width=2, 
        channels=1
    )
    audio_segment.export('output_audio.wav', format='wav')
    print("Audio file generated: output_audio.wav")
except Exception as e:
    print(f"Error writing audio file: {e}")

It relies on pydub instead of the standard way that text is processed by the source code, however, just FYI.

BBC-Esq avatar Feb 02 '24 12:02 BBC-Esq