Memory Usage in Kokoro
I'm trying to incorporate MLX-Audio into AIChat that generates a conversation between two LLMs as speech. However, memory consumption just goes up and up and up indefinitely as I generate more. I don't have this problem with kokoro-onnx. I'm not sure if it's my code or MLX-audio, but it doesn't seem to release previous generation from memory after I append the samples to an array? Here's the minimal code that I can use to produce.
from mlx_audio.tts.utils import load_model
import numpy as np
import codecs
import random
import json
import soundfile as sf
def random_pause(min_duration=0.5, max_duration=1.0, sample_rate=24000):
silence_duration = random.uniform(min_duration, max_duration)
silence = np.zeros(int(silence_duration * sample_rate))
return silence
def generate_audio(text, voice, l, r, speed=1.0, lang="a", sample_rate=24000):
res = tts.generate(
text=text,
voice=voice,
speed=speed,
lang_code=lang,
temperature=0.7,
verbose=False,
)
samples = list(res)[0].audio
pause = random_pause(sample_rate=sample_rate)
samples = np.concatenate([samples, pause])
samples = np.column_stack((samples*l, samples*r))
return samples
tts = load_model("mlx-community/Kokoro-82M-bf16")
a_voice= "af_sky"
b_voice = "af_heart"
chat = json.load(codecs.open("chat.json", "r", "utf-8"))
wavs = []
for i in range(0, len(chat), 2):
content = chat[i]["content"]
print(a_voice, content)
samples = generate_audio(content, a_voice, 0.8, 1.0)
wavs.append(samples)
content = chat[i+1]["content"]
print(b_voice, content)
samples = generate_audio(content, b_voice, 1.0, 0.8)
wavs.append(samples)
wav = np.concat(wavs)
sf.write(f"podcast.wav", wav, 24000)
I'm also attaching the chat history that you can use to reproduce.
Thanks!
If you do del res before returning from generate_audio() does it fix it? You could also try a direct np.array(samples) cast to for evaluation and make sure you aren't holding a reference to the underlying mx.array.
In theory the function scope would control your references to the MLX graph, but there could be a lingering reference somewhere since you're persisting all the downstream audio segments.
In addition to what Lucas said, you can also do mx.metal.clear_cache() at the end of each generation
Yes, he is accumulating results without releasing resources.
In regard to evaluation, on v0.0.2 we added eval on Kokoro.
I changed to:
audio = list(res)[0].audio
samples = np.array(audio)
del res
del audio
However, it still going up crazy. Also I tried mlx.metal.clear_cache(), but it says mlx doesn't have attribute metal. I do collect samples returned from generate_audio into an list and assemble all the generated sample into one wav file at the end. However, collecting them doesn't seem to have a problem in kokoro-onnx.
https://ml-explore.github.io/mlx/build/html/search.html?q=clear+cache#
Can you share the text you are generating?
And also try to use the generate.py and pass --join_audio flag and see if the same happens
Yes, If you just use the chat history file in json that I attached in the first post, you should be able to duplicate. Thanks!
Ah it's mlx.core.metal.clear_casche(). It kind of works. Memory consumption goes back down after a generation. However, each generation takes more memory. For example: generation 1: 16gb-22gb generation 2: 16gb-26gb generation 3: 16gb-30gb generation 4: 16gb-34gb ... I just made up the numbers to illustrate the trend. They're not the actual numbers I saw. Thanks for your help!
Awesome!
We will take a look and address it in future releases
I think this is something you might want to address soon. I looked into it further, and the memory consumption differs significantly compared to kokoro-onnx.
When I use mlx-audio, memory usage increases rather quickly. Even though the peak memory usage reported by model.generate() shows around 4.5GB per generation, Activity Monitor shows total used memory jumping from 10GB to 50GB. Wired memory also climbs to about 40GB only after several generations.
This is with using mlx.core.metal.clear_cache()—without clearing the cache, my computer becomes unusable very quickly after a few generations.
In contrast, running the same script with kokoro-onnx, total memory stays around 16GB and increases very slowly as it accumulates the generated audio. It increases by exactly how much I expect the total audio would be. Wired memory stays steady at around 5.5-7GB depending on the generation.
I'm using the largest kokoro-onnx model (about 325MB), so I believe the precision should be comparable.
Thanks again!
FYI, here's a screen recording. https://github.com/user-attachments/assets/0d14a67e-76d0-49dc-a1f2-4bbfe2ac9de2 Hope that helps. Thanks.
Sure, will look into it.
@chigkim Try it in the latest with https://github.com/Blaizzy/mlx-audio/pull/66 -- I didn't see any swapping when running your test script with that change.
Thanks @lucasnewman! I just tried the latest commit which includes #66. However, my memory consumption still slowly creeps up to 50GB According to activity monitor. You don't see that behavior? I think maybe it goes up slower than before , but still way more then it should be. Especially peak memory usage reported by model.generate() only shows around 4.5GB.
The peak memory isn't considering the cache size -- it's possible it's recompiling too often and growing the cache or something. You can check with mx.get_cache_memory().
On the other hand, utilizing a bunch of otherwise available memory for a cache seems like intended behavior as long as it doesn't push the machine into swapping to disk -- does it eventually fail for you or is it just the memory metrics you find alarming?
I have the same issue. Installed mlx-audio just now.
Started the server with mlx_audio.server --host 0.0.0.0 --port 9000.
put some non-trivial amount of text into the text box:
Once upon a time, in a far corner of the universe, there was a small, lonely star named Twinkle. While other stars clustered together in constellations and galaxies, Twinkle floated alone in the vast darkness of space.
Every night, Twinkle would shine as brightly as possible, hoping to catch the attention of passing comets or distant planets. But no one ever seemed to notice the little star's efforts.
One day, a curious young comet named Zip zoomed by. Twinkle, excited at the prospect of company, called out, "Hello! Would you like to be friends?"
Zip, surprised by the star's friendliness, slowed down and replied, "I've never had a star for a friend before. That sounds lovely!"
Hit "Generate Speech", wait till it starts reading, change voice, hit again... Repeat 7-10 times and your memory is gone. Each generation bumps RAM consumption by 5GB. Had to reboot my machine, with 256 GB RAM... I had already other models in RAM, but still... That caught me by surprise.
mlx 0.24.1
mlx-audio 0.0.3
mlx-lm 0.22.2
mlx-vlm 0.1.21
Restarting the server reclaims the memory back.
I have just tried installing https://github.com/hexgrad/kokoro locally and ran the "demo" folder.
That was not so hard to do and the speed of speech generation in the gradio UI on "slow" CPU seems still to be faster than with mlx-audio on my machine. Not sure, why that is so, but I can generate 24 seconds audio in 2.5 seconds. So slightly less than 10x (maybe 9.5). mlx-audio server reported 0.25 realtime speed, so 4x? Also there are no memory issues with the official demo (although it does not support paragraphs in the input, one has to remove all the newlines to squeeze more content).
Maybe this is somehow influenced by the fact that I use a MacStudio 3 Ultra, but I honestly expected the MLX version to be much more optimized than the plain CPU version from Kokoro. Something seems really fishy...
Glad to run some more tests to understand the root cause, just tell me should be changed. A rough difference of 2X between CPU and GPU (with CPU being faster) just does not make much sense...
Hey @mindreframer and @chigkim
I looked into it and I managed to replicate and somewhat reduce the issue.
But in general I still see 5-7GB bumps for each chunk as reported. With a small fix I noticed it going back to 600-800mb after the chunk is generated.
I will looking into it and fix it ASAP.
Hey @chigkim
Fixed the issue here #164
The itermediate and final results were being stored in the cache memory which made it blow up every iteration.
Using mx.clear_cache() after every audio chunk generated resolves this issue by regularly freeing up that cached memory.
Thanks so much! Memory issue is definitely fixed. My memory consumption only goes up to 24GB!!! I'm not sure if it's related with this fix or sampling rate fix, but Kokoro now generates very poor audio though. It sounds like from really bad resampling. Does it now resample the output? It has a very ringing pitch.
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello there! How/s it going?" --voice "af_sky"
https://github.com/user-attachments/assets/ac322afd-79bb-4845-9696-fed6337a4638
Could you look into this? Thanks!
My pleasure!
The Kokoro issue was fixed here #166.
It was tiny bug on our efforts deduplicate and consolidate the API for all modules (codec, stt, tts, and sts)
Just pull the latest changes. And these should be available later today
Haha you guys are so fast! Thanks!!!
Thanks, we move fast!
Btw v0.2.2 is out alongside support for the new Outetts-v1-0.6B :)
pip install -U mlx-audio