how to use sesame labs?
how to use sesame labs?
Someone converted to MLX. https://github.com/senstella/csm-mlx CSM that Sesame released as open sourced is only good for generating a short sentence at a time though. Otherwise, beyond that, it produces garbage.
It's in the repo already, but there's no release for it so you need to install it from the main branch:
pip install git+https://github.com/Blaizzy/mlx-audio.git@main
then:
python -m mlx_audio.tts.generate --model mlx-community/csm-1b --text "Hello from Sesame." --play
You can provide the --ref_audio and --ref_text parameters to use voice matching as well.
Exactly what @lucasnewman said ✅
Release is coming later today.
I'm just finishing porting suno-bark :)
How do we get just array of audio data directly in Python without saving to a file or using cli?
model = load_model(model_id)
Fetching 2 files: 100%|█████████████████████████| 2/2 [00:00<00:00, 2720.92it/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/cgk/Desktop/coding/llm/mlx/mlx-audio/mlx_audio/tts/utils.py", line 142, in load_model
model_config = model_class.ModelConfig.from_dict(config)
^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'mlx_audio.tts.models.sesame' has no attribute 'ModelConfig'
Thanks!
Fixed in #50
Awesome, thank you for the fix with lightening speed!!! Quick question, is there a way to construct context with two speakers and feed it to model.generate for sesame csm?
Here's a snippet from SesameAILabs/csm.
from generator import Segment
speakers = [0, 1, 0, 0]
transcripts = [
"Hey how are you doing.",
"Pretty good, pretty good.",
"I'm great.",
"So happy to be speaking to you.",
]
audio_paths = [
"utterance_0.wav",
"utterance_1.wav",
"utterance_2.wav",
"utterance_3.wav",
]
def load_audio(audio_path):
audio_tensor, sample_rate = torchaudio.load(audio_path)
audio_tensor = torchaudio.functional.resample(
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
)
return audio_tensor
segments = [
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
text="Me too, this is some cool stuff huh?",
speaker=1,
context=segments,
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Thanks always!
I saw the same example!
Not sure it's functional here yet. But this is possible and it's top of the list after I finish Orpheus port.
Thank you! Would you be interested in a PR for automatically generating ref_text and resampling to 24khz, so people can just supply any wav file as ref_audio?
On a related note, are you going to implement training/finetuning for Orpheus at some point? That would be amazing for us to train/finetune tts models with MLX! Then I guess we would need a way to convert to torch after mlx models for distribution for wider audience? lol NO idea how you keep up with so many things to do! :)
That would be awesome!
We can achieve it with STT
Yes, training Orpheus is a possibility.
I believe it can be done with the help of MLX-LM and some high-level utils for processing the audio output.
In general, I'm thinking about a trainer but am waiting on a few new models so I can have a good idea of design that fits them since each model varies drastically.
NO idea how you keep up with so many things to do! :)
Neither do I 🤣🙌🏽
Literally so many new things coming out!
On a sidenote does mlx-audio implement anything like this: https://github.com/freddyaboulton/fastrtc
Specifically @reach_vb mentioned this post on X: https://github.com/sofi444/realtime-transcription-fastrtc
I’m thinking it would enable something like MLX realtime audio chat.
I don’t know much about coding so have no idea if it’s beneficial in our case or just adds unnecessary complexity.
Just saw one of the examples is also real time object detection. I wonder if it could be implemented in MLX-VLM specifically for on screen bounding boxes 🤔 Allowing for a proper MacOS Computer Use Agent
Literally so many new things coming out!
On a sidenote does mlx-audio implement anything like this: https://github.com/freddyaboulton/fastrtc
Specifically @reach_vb mentioned this post on X: https://github.com/sofi444/realtime-transcription-fastrtc
I’m thinking it would enable something like MLX realtime audio chat.
I don’t know much about coding so have no idea if it’s beneficial in our case or just adds unnecessary complexity.
Yes, I'm aware of it.
This fits into Speech-To-Speech side. I did some basic prototyping but I haven't started working on STS yet.
Just saw one of the examples is also real time object detection. I wonder if it could be implemented in MLX-VLM specifically for on screen bounding boxes 🤔 Allowing for a proper MacOS Computer Use Agent
Cool idea but it's not how it will work in practice.
Because the current GUI models are big and you don't want to run them like that. Even if it was possible you don't need real-time to have a good Computer-Use since it will introduce noise.
I have a talk I gave last year about it that I will upload to YT.
Brilliant! Thanks a lot for your help. If you need help for testing from a layperson's perspective, I'd be more than happy to do so. For:
- Computer Use
- TTS or STS Have a Macbook M1 Max 64GB.
Thanks, much needed!
Will keep that in mind :)
That would be awesome!
We can achieve it with STT
Exactly, I already have a working code using mlx-Whisper. I'll create a PR shortly.
Adding instructions in readme, including --ref_audio and --ref_text parameters would be great. I'll try it later.
@charmaineem here is a great section for the docs. Inference examples for each model.