candle icon indicating copy to clipboard operation
candle copied to clipboard

MetaVoice?

Open groovybits opened this issue 1 year ago • 11 comments

MetaVoice seems amazing as a TTS allowing any voice model 1 shot training instantly (sounds too good to be true).

https://github.com/metavoiceio/metavoice-src/issues/1

It has some issues with MPS of course and would be nice to put into candle, is this something that technically is possible?

I haven't looked close, but could try myself but also suspect it's a big job? Putting it on the radar if not already since I really need this :D and many others certainly do too since a missing piece in "good" TTS that is fully open/free and especially being run in Rust like this!

groovybits avatar Feb 14 '24 15:02 groovybits

We're certainly lacking a good TTS example at the moment, as pointed out in #1428 (we already cover speech to text with whisper, and both image to text and text to image). I started putting up a musicgen example but didn't finish it, it's based on encodec which metavoice also use so I might well resume work on this. I think it's actually a bit of work as it's a new type of modality but probably not something impossible neither.

LaurentMazare avatar Feb 14 '24 16:02 LaurentMazare

We're certainly lacking a good TTS example at the moment, as pointed out in #1428 (we already cover speech to text with whisper, and both image to text and text to image). I started putting up a musicgen example but didn't finish it, it's based on encodec which metavoice also use so I might well resume work on this. I think it's actually a bit of work as it's a new type of modality but probably not something impossible neither.

Nice, yes I love musicgen too so that sounds amazing!

groovybits avatar Feb 14 '24 16:02 groovybits

https://github.com/RVC-Boss/GPT-SoVITS seems also very good at voice generation

tlightsky avatar Feb 22 '24 05:02 tlightsky

An initial version of metavoice is now available, #1717 , you can give this a shot with this example. Please let us know how it goes, note that speaker embeddings are not available at the moment so no voice cloning, and that quality can probably be improved.

LaurentMazare avatar Mar 02 '24 20:03 LaurentMazare

Just tried it out on my M2(CPU), took about a minute but it works!

phudtran avatar Mar 02 '24 21:03 phudtran

Exciting amazing progress!

I seem to get a failure with using cpu or metal. With CPU you can see here it outputs information but doesn't use GPU/CPU and sits there forever not doing anything. With Metal it outputs about a missing function...

MacBook-Pro:candle christi$ cargo run --example metavoice --release -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --out-file out.wav --tracing


    Finished release [optimized] target(s) in 0.32s
     Running `target/release/examples/metavoice --prompt 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.' --out-file out.wav --tracing`
avx: false, neon: false, simd128: false, f16c: false
Running on CPU, to run on GPU, build this example with `--features cuda`
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
^C
MacBook-Pro:candle christi$ cargo run --example metavoice --release --features metal -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model."


    Finished release [optimized] target(s) in 0.92s
     Running `target/release/examples/metavoice --prompt 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'`
avx: false, neon: false, simd128: false, f16c: false
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
Error: Metal error Error while loading function: "Function 'cast_bf16_f32' does not exist"

Caused by:
    Error while loading function: "Function 'cast_bf16_f32' does not exist"
Thank you!

groovybits avatar Mar 02 '24 21:03 groovybits

Exciting amazing progress!

I seem to get a failure with using cpu or metal. With CPU you can see here it outputs information but doesn't use GPU/CPU and sits there forever not doing anything. With Metal it outputs about a missing function...

MacBook-Pro:candle christi$ cargo run --example metavoice --release -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model." --out-file out.wav --tracing


    Finished release [optimized] target(s) in 0.32s
     Running `target/release/examples/metavoice --prompt 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.' --out-file out.wav --tracing`
avx: false, neon: false, simd128: false, f16c: false
Running on CPU, to run on GPU, build this example with `--features cuda`
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
^C
MacBook-Pro:candle christi$ cargo run --example metavoice --release --features metal -- --prompt "This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model."


   Finished release [optimized] target(s) in 0.92s
    Running `target/release/examples/metavoice --prompt 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'`
avx: false, neon: false, simd128: false, f16c: false
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
Error: Metal error Error while loading function: "Function 'cast_bf16_f32' does not exist"

Caused by:
   Error while loading function: "Function 'cast_bf16_f32' does not exist"
   ```
   
   Thank you!

Metal isn't supported yet, but for CPU it also took a bit of time for me. You just gotta let it run, it will finish eventually.

phudtran avatar Mar 02 '24 21:03 phudtran

Ah yes I see that now after quite awhile. Yes works here too, thank you!

Running on CPU, to run on GPU, build this example with `--features cuda`
prompt: 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.'
[2133, 2153, 2320, 2388, 2307, 2434, 2158, 2160, 2328, 2305, 2150, 2169, 2165, 2327, 2311, 2456, 2150, 2419, 2452, 2428, 2377, 2146, 2135, 2160, 2355, 2150, 2094, 2098, 2115, 2093, 2399, 2313, 2161, 2325, 2094, 2164, 2483, 2374, 2323, 2514, 2487, 2380, 2307, 2166, 2149, 2154, 2160, 2321, 2160, 2149, 2150, 2157, 2095, 2561]
text ids len: 55
sampling from logits...
codes: [[[1109, 1129, 1296, ...,  738,  408, 1024],
  [1024, 1024, 1024, ...,  913,  424, 1024],
  [1024, 1024, 1024, ...,  786,   36, 1024],
  ...
  [1024, 1024, 1024, ...,  881, 1011, 1024],
  [1024, 1024, 1024, ..., 1015,  853, 1024],
  [1024, 1024, 1024, ..., 1019,  948, 1024]]]
Tensor[[1, 8, 538], u32]
text_ids len: 54
audio_ids shape: [1, 8, 483]
output pcm shape: [1, 1, 154930]

groovybits avatar Mar 02 '24 22:03 groovybits

Yeah it takes a bit of time to get the generation back, maybe we should have some progress bar or some other way to know that the process is not stuck. I'm also looking at getting this to run on metal though currently it doesn't seem to bring much speedup on a M2.

LaurentMazare avatar Mar 02 '24 22:03 LaurentMazare

Very fast now on metal M2 Ultra! Amazing job Thank you :)

chris@earth candle % time cargo run --example metavoice --release --features=metal -- --prompt "hi how are you today"
    Finished release [optimized] target(s) in 0.15s
     Running `target/release/examples/metavoice --prompt 'hi how are you today'`
avx: false, neon: true, simd128: false, f16c: false
prompt: 'hi how are you today'
[2153, 2154, 2337, 2352, 2476, 2371, 2327, 2149, 2376, 2561]
text ids len: 11
sampling from logits...
codes: [[[1129, 1130, 1313, 1328, 1452, 1347, 1303, 1125, 1352, 1537,  780,  537,  798,
    499,   91,   70,  112,  949,  949,  945,  945,  344,  561,  770,  182,  784,
    984,  793,  414,  793,  983,  890,   23,  598,  321,  224,  136,  432,  860,
    598,  224,  491,  835, 1019,   25,  619,   25,  904,  321,  224, 1019,  876,
   1019,  420,  751,  813,  368,  683,  683,  495,  402, 1022, 1022,  402, 1001,
    495,  967,  136,  651,  491,  136,  976,  491,  430,  855, 1019,  738,  855,
    106,  106,  738,  106,  106,  106, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  601,  228,  870,
    870, 1007,  945,  897,  242,  760,  264,  961,  559,  399,  438,  279,  561,
    441,  626,  269,  475,  211,  502,  726,  165,  962,  664,  673,  826,  519,
    588,  897,  265,  974,  928,  860,  144,   81,  460,  579,  259,  941,  765,
    544,  144,  947,   36,  679,  801,  549,  796,  549,  422,   36,  801,   36,
     36,  144,  792,  920,  510,  801,  519,  942,  687,  519,  404,  363,  404,
    942,  913,  518,  913,  424,  363, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  852,  508,  432,
    159,  728, 1022,  728,  956,  443,  845,  593,   77,  650,  166,  866,  812,
     96,  176,  644,  673,  647,  119,  587,   24,  818,  842,  518,  308,  915,
    675,  818,  653,  879,  710, 1000,  590,  601,  970,  204,  185,  426,  710,
    915,  907,  287,  636,  773,  946,  111,  564,  638,  564,  828,  564,  998,
    853,  775,  237,  518,   93,  859,  832,  406, 1000,  829,  879, 1007,   36,
     36,  710,   36,  982,  653, 1015, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  866,  830,  318,
    730,  601,  632,   26,  152,  730,   75,  236,  798,  537,  161,  267,  286,
    923,  575,  915,  914,  197,  993,  119,  190,  776,  614,  993,  558,  388,
    364,  255, 1016,   74,  734,  288,  522,  926,  278,   61,  529,  919,   74,
    859,  841,  471,  277,  605,  796,  970,  810,  272,  345,  353,  242,  901,
    589,  933,  878,  853,  557, 1016,  960,  443,  961,  793,  838,  962, 1022,
    866,  838,  741,  956,  673,  956, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  528, 1007,  944,
    617,  756,  676,  467,  971,  164,  502,  959,  446,  842,  452,  483,  846,
    246,  410,  493,  433,  335,  302,  317,  907, 1003,  838, 1003,  658,  154,
     39,  909,  446,  862,  804,  375,  667,  373,  616,  983,  113,  882,  736,
    454,    8,  163,  893,  899,  993,  872,  866,  551,  108,  615,   78,   63,
    822,  959,  969,  397,   90,  313, 1017,  111,  357,  413,  111,  528,  882,
    622,  606,  375,  882,  904,  528, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  701,  489,  907,
    975,   89,  960,   21,  112,  751,  905,  372,  634,  805,  112,  932,  868,
    100,  266,  501,  477,  602,   57,  253,  624,  519,  388,  611,  669,  918,
    505,   10,  238,  632,  640,  701,   96,  236,  982,  350,  704,  632,   10,
    851,  606,  880,  448,  147,  907,  658,  805,  278,  982,  621,  956,  690,
    466,  760,  757,  828,  958,  768,  314,  461,  238,  461,  995,  982, 1011,
    929,  435,   41,  986,  435, 1011, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1015,  380,  596,
    493,  685,  519,  728,  630,  581,  685,  770,  164,  152,  173,  786,  435,
    648,  720,  585,  845,  694,  647,  971,  243, 1008,  496,  579,  620,  764,
    444,  188,  994,  390,  786,  983,  632,  866,  365,  586,  928,  291,  782,
   1015,  586,  940,  718,  576,  399,  682,   16,  295,  877,  581,  402,   67,
    383,  820,  360,   28,  416,   45,  496,  675,  480,  887,  853,  291,  291,
    887,  772, 1002,  748,  900,  570, 1024],
  [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,  899,  117,  562,
    544,  766,  647,  339,   23,  125,  639,  758,  810,  636,  638,  191,  366,
    520,  288,  679,   65,  458,  968, 1019,  660,  160,  343,  701,  233,  615,
    204,  884,  562,  818,  835,  468,  529,  878,  429,  429,  472,  828,  475,
    947,  591,  777,  688,  650,  892,  458,  541,  799,  778,  791,  383,  505,
      2,  961,  737,  669,  416,  660,  401,  660,  835,  989, 1019, 1012, 1019,
    475,  975,  931,  383,  475,  975, 1024]]]
Tensor[[1, 8, 85], u32, metal:4294968481]
text_ids len: 10
audio_ids shape: [1, 8, 74]
output pcm shape: [1, 1, 24050]
cargo run --example metavoice --release --features=metal -- --prompt   
2.12s user 3.63s system 75% cpu 7.648 total

metavoice.wav.gz

A bit wavy as understood that that part is still in progress :)

groovybits avatar Mar 03 '24 10:03 groovybits

Just tried it out on my M2(CPU), took about a minute but it works!

Hey,

Can you please help me. I'm trying without installing xformers and q=q.half(), it's getting stuck while running as shown in screenshot without any error. I'm using mac m2 with CPU.

Screenshot 2024-07-23 at 9 24 15 PM

kaushalag29 avatar Jul 24 '24 02:07 kaushalag29