mlx-audio icon indicating copy to clipboard operation
mlx-audio copied to clipboard

how to use sesame labs?

Open GaleiqTesting opened this issue 11 months ago • 20 comments

how to use sesame labs?

GaleiqTesting avatar Mar 19 '25 02:03 GaleiqTesting

Someone converted to MLX. https://github.com/senstella/csm-mlx CSM that Sesame released as open sourced is only good for generating a short sentence at a time though. Otherwise, beyond that, it produces garbage.

chigkim avatar Mar 19 '25 10:03 chigkim

It's in the repo already, but there's no release for it so you need to install it from the main branch:

pip install git+https://github.com/Blaizzy/mlx-audio.git@main

then:

python -m mlx_audio.tts.generate --model mlx-community/csm-1b --text "Hello from Sesame." --play

You can provide the --ref_audio and --ref_text parameters to use voice matching as well.

lucasnewman avatar Mar 19 '25 17:03 lucasnewman

Exactly what @lucasnewman said ✅

Release is coming later today.

I'm just finishing porting suno-bark :)

Blaizzy avatar Mar 19 '25 18:03 Blaizzy

How do we get just array of audio data directly in Python without saving to a file or using cli?

model = load_model(model_id)
Fetching 2 files: 100%|█████████████████████████| 2/2 [00:00<00:00, 2720.92it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/cgk/Desktop/coding/llm/mlx/mlx-audio/mlx_audio/tts/utils.py", line 142, in load_model
    model_config = model_class.ModelConfig.from_dict(config)
                   ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'mlx_audio.tts.models.sesame' has no attribute 'ModelConfig'

Thanks!

chigkim avatar Mar 20 '25 10:03 chigkim

Fixed in #50

Blaizzy avatar Mar 20 '25 11:03 Blaizzy

Awesome, thank you for the fix with lightening speed!!! Quick question, is there a way to construct context with two speakers and feed it to model.generate for sesame csm?

Here's a snippet from SesameAILabs/csm.

from generator import Segment

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Thanks always!

chigkim avatar Mar 20 '25 12:03 chigkim

I saw the same example!

Not sure it's functional here yet. But this is possible and it's top of the list after I finish Orpheus port.

Blaizzy avatar Mar 20 '25 12:03 Blaizzy

Thank you! Would you be interested in a PR for automatically generating ref_text and resampling to 24khz, so people can just supply any wav file as ref_audio?

On a related note, are you going to implement training/finetuning for Orpheus at some point? That would be amazing for us to train/finetune tts models with MLX! Then I guess we would need a way to convert to torch after mlx models for distribution for wider audience? lol NO idea how you keep up with so many things to do! :)

chigkim avatar Mar 20 '25 12:03 chigkim

That would be awesome!

We can achieve it with STT

Blaizzy avatar Mar 20 '25 12:03 Blaizzy

Yes, training Orpheus is a possibility.

I believe it can be done with the help of MLX-LM and some high-level utils for processing the audio output.

In general, I'm thinking about a trainer but am waiting on a few new models so I can have a good idea of design that fits them since each model varies drastically.

Blaizzy avatar Mar 20 '25 12:03 Blaizzy

NO idea how you keep up with so many things to do! :)

Neither do I 🤣🙌🏽

Blaizzy avatar Mar 20 '25 12:03 Blaizzy

Literally so many new things coming out!

On a sidenote does mlx-audio implement anything like this: https://github.com/freddyaboulton/fastrtc

Specifically @reach_vb mentioned this post on X: https://github.com/sofi444/realtime-transcription-fastrtc

I’m thinking it would enable something like MLX realtime audio chat.

I don’t know much about coding so have no idea if it’s beneficial in our case or just adds unnecessary complexity.

GaleiqTesting avatar Mar 20 '25 13:03 GaleiqTesting

Just saw one of the examples is also real time object detection. I wonder if it could be implemented in MLX-VLM specifically for on screen bounding boxes 🤔 Allowing for a proper MacOS Computer Use Agent

GaleiqTesting avatar Mar 20 '25 13:03 GaleiqTesting

Literally so many new things coming out!

On a sidenote does mlx-audio implement anything like this: https://github.com/freddyaboulton/fastrtc

Specifically @reach_vb mentioned this post on X: https://github.com/sofi444/realtime-transcription-fastrtc

I’m thinking it would enable something like MLX realtime audio chat.

I don’t know much about coding so have no idea if it’s beneficial in our case or just adds unnecessary complexity.

Yes, I'm aware of it.

This fits into Speech-To-Speech side. I did some basic prototyping but I haven't started working on STS yet.

Blaizzy avatar Mar 20 '25 13:03 Blaizzy

Just saw one of the examples is also real time object detection. I wonder if it could be implemented in MLX-VLM specifically for on screen bounding boxes 🤔 Allowing for a proper MacOS Computer Use Agent

Cool idea but it's not how it will work in practice.

Because the current GUI models are big and you don't want to run them like that. Even if it was possible you don't need real-time to have a good Computer-Use since it will introduce noise.

I have a talk I gave last year about it that I will upload to YT.

Blaizzy avatar Mar 20 '25 13:03 Blaizzy

Brilliant! Thanks a lot for your help. If you need help for testing from a layperson's perspective, I'd be more than happy to do so. For:

  • Computer Use
  • TTS or STS Have a Macbook M1 Max 64GB.

GaleiqTesting avatar Mar 20 '25 13:03 GaleiqTesting

Thanks, much needed!

Will keep that in mind :)

Blaizzy avatar Mar 20 '25 13:03 Blaizzy

That would be awesome!

We can achieve it with STT

Exactly, I already have a working code using mlx-Whisper. I'll create a PR shortly.

chigkim avatar Mar 20 '25 15:03 chigkim

Adding instructions in readme, including --ref_audio and --ref_text parameters would be great. I'll try it later.

ivanfioravanti avatar Mar 22 '25 10:03 ivanfioravanti

@charmaineem here is a great section for the docs. Inference examples for each model.

Blaizzy avatar Mar 29 '25 13:03 Blaizzy