Is it possible to know or highlight which word is being spoken?

Open sickerin opened this issue 6 months ago • 1 comments

For instance after generating a paragraph. I would like to have information of when each word starts in time. Let's say for this sentence "The boy was there when the sun rose. A rod is used to catch pink salmon." I would like to also get the data, when each word starts.

The 0.0s boy 0.5s was 1.2s there 2.0s etc

I'm trying to use the kokoro model. Are there other models that might be lightweight and available on mlx that be able to do this?

Jun 13 '25 15:06 sickerin

I found that there's this in kokoro, not sure if it's implemented https://github.com/hexgrad/kokoro/issues/32

Aug 10 '25 03:08 sickerin