mlx-audio
mlx-audio copied to clipboard
Is it possible to know or highlight which word is being spoken?
For instance after generating a paragraph. I would like to have information of when each word starts in time. Let's say for this sentence "The boy was there when the sun rose. A rod is used to catch pink salmon." I would like to also get the data, when each word starts.
The 0.0s boy 0.5s was 1.2s there 2.0s etc
I'm trying to use the kokoro model. Are there other models that might be lightweight and available on mlx that be able to do this?
I found that there's this in kokoro, not sure if it's implemented https://github.com/hexgrad/kokoro/issues/32