metavoice-src icon indicating copy to clipboard operation
metavoice-src copied to clipboard

I created a working Dockerfile to build it consistently (and on my system)...

Open groovybits opened this issue 1 year ago • 23 comments

Hi I created this Dockerfile just now to make building easier (and allow it on my system where Python seems unhappy, Mac)...

I am wondering about if this could be included and improved upon since I haven't gotten it running locally yet but at least got it built for me...

Branch with changes: https://github.com/metavoiceio/metavoice-src/compare/main...groovybits:metavoice-src:docker

Also I am wondering about which port it uses and how to use this? Or more information on how to use it locally like an http server if this is what does that?

Thanks! I am reading up / exploring as I can since this looks amazing :)

Updated: builds in docker compose now, not sure if "working" yet (need to look closer).

groovybits avatar Feb 12 '24 20:02 groovybits

sure thing, will review it tomorrow morning

sidroopdaska avatar Feb 13 '24 00:02 sidroopdaska

I think it's port 58003. Also, you should use pip install -e . for the app package to install correctly.

(Still, personally, I cannot currently make this work because of other errors.)

djmaze avatar Feb 13 '24 00:02 djmaze

Hey @djmaze, can you share some details

  1. GPU being used
  2. Python version
  3. Error you are facing

sidroopdaska avatar Feb 13 '24 11:02 sidroopdaska

I had to increase docker server memory ram allocated quite a bit, so now it loads (i was getting a 137 exit code until I fixed that by increasing ram for docker).

Here is the output now... (needed to add git to the installed programs then it works...)

Seems to overall work with the calls to it now to 58003 (I changed docker compose to map the 58003:58003 and also put server listening on 0.0.0.0 to allow access from the host system). Going to see how to use it now :) I am guessing it actually uses the GPU technically on MPS this way or is this needing something docker can't do for Metal GPU? I am still trying to understand that aspect of this...

Using Mac M2 Ultra with Python 3.11

(.venv) chris@earth metavoice-src % docker compose up --build
[+] Building 146.5s (12/12) FINISHED                                                                                                                  docker:desktop-linux
 => [metavoice-server internal] load build definition from Dockerfile                                                                                                 0.0s
 => => transferring dockerfile: 920B                                                                                                                                  0.0s
 => [metavoice-server internal] load metadata for docker.io/library/python:3.11-slim                                                                                  0.4s
 => [metavoice-server internal] load .dockerignore                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                                       0.0s
 => [metavoice-server 1/7] FROM docker.io/library/python:3.11-slim@sha256:0b7568850e62c6405d15098486b8fc23d9d6a845b112b866aceaf65fbded2fc1                            0.0s
 => [metavoice-server internal] load build context                                                                                                                    0.8s
 => => transferring context: 3.26MB                                                                                                                                   0.8s
 => CACHED [metavoice-server 2/7] RUN apt-get update && apt-get install -y     ffmpeg     ninja-build     g++     build-essential     libomp-dev     && rm -rf /var/  0.0s
 => CACHED [metavoice-server 3/7] WORKDIR /app                                                                                                                        0.0s
 => [metavoice-server 4/7] COPY . .                                                                                                                                   1.1s
 => [metavoice-server 5/7] RUN pip install --no-cache-dir "torch>=2.1.0"                                                                                              9.5s
 => [metavoice-server 6/7] RUN MAX_JOBS=1 pip install --no-cache-dir -r requirements.txt                                                                            129.1s
 => [metavoice-server 7/7] RUN pip install -e .                                                                                                                       3.6s
 => [metavoice-server] exporting to image                                                                                                                             2.0s
 => => exporting layers                                                                                                                                               2.0s
 => => writing image sha256:f3f4f48ddd7728be8b77eb04b70a81c00cd452dd9f7dbff307fa2ea091e7bb51                                                                          0.0s
 => => naming to docker.io/library/metavoice-src-metavoice-server                                                                                                     0.0s
[+] Running 1/0
 ✔ Container metavoice  Recreated                                                                                                                                     0.0s
Attaching to metavoice
metavoice  | /usr/local/lib/python3.11/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
metavoice  |   from torchaudio.backend.common import AudioMetaData
metavoice  | /app/fam/llm/layers/attn.py:14: UserWarning: flash_attn not installed, make sure to replace attention mechanism with torch_attn
metavoice  |   warnings.warn("flash_attn not installed, make sure to replace attention mechanism with torch_attn")
Fetching 6 files:   0% 0/6 [00:00<?, ?it/s]downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/second_stage.pt to /.hf-cache/hub/tmpe_usumgi
metavoice  | downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/speaker_encoder.pt to /.hf-cache/hub/tmpnt4gjkty
metavoice  | downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/.gitattributes to /.hf-cache/hub/tmpqt19yl5p
metavoice  | downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/config.json to /.hf-cache/hub/tmputo3939d
metavoice  | downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/README.md to /.hf-cache/hub/tmpq4thb9ry
.gitattributes: 100% 1.52k/1.52k [00:00<00:00, 15.5MB/s]
config.json: 100% 39.0/39.0 [00:00<00:00, 500kB/s]]
README.md: 100% 2.67k/2.67k [00:00<00:00, 26.2MB/s]
metavoice  | 0% 0.00/2.67k [00:00<?, ?B/s]               downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/first_stage.pt to /.hf-cache/hub/tmpicnsur8a00, 54.5MB/s]
speaker_encoder.pt: 100% 17.1M/17.1M [00:00<00:00, 25.9MB/s]
second_stage.pt: 100% 57.9M/57.9M [00:00<00:00, 71.8MB/s]
first_stage.pt: 100% 4.97G/4.97G [00:47<00:00, 104MB/s]MB/s]
Fetching 6 files: 100% 6/6 [00:49<00:00,  8.17s/it]
metavoice  | number of parameters: 1239.00M
metavoice  | downloading https://huggingface.co/facebook/encodec_24khz/resolve/main/config.json to /.hf-cache/hub/tmp1itot20c
config.json: 100% 809/809 [00:00<00:00, 7.05MB/s]
metavoice  | loading configuration file config.json from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/config.json
metavoice  | Model config EncodecConfig {
metavoice  |   "_name_or_path": "ArthurZ/encodec_24khz",
metavoice  |   "architectures": [
metavoice  |     "EncodecModel"
metavoice  |   ],
metavoice  |   "audio_channels": 1,
metavoice  |   "chunk_length_s": null,
metavoice  |   "codebook_dim": 128,
metavoice  |   "codebook_size": 1024,
metavoice  |   "compress": 2,
metavoice  |   "dilation_growth_rate": 2,
metavoice  |   "hidden_size": 128,
metavoice  |   "kernel_size": 7,
metavoice  |   "last_kernel_size": 7,
metavoice  |   "model_type": "encodec",
metavoice  |   "norm_type": "weight_norm",
metavoice  |   "normalize": false,
metavoice  |   "num_filters": 32,
metavoice  |   "num_lstm_layers": 2,
metavoice  |   "num_residual_layers": 1,
metavoice  |   "overlap": null,
metavoice  |   "pad_mode": "reflect",
metavoice  |   "residual_kernel_size": 3,
metavoice  |   "sampling_rate": 24000,
metavoice  |   "target_bandwidths": [
metavoice  |     1.5,
metavoice  |     3.0,
metavoice  |     6.0,
metavoice  |     12.0,
metavoice  |     24.0
metavoice  |   ],
metavoice  |   "torch_dtype": "float32",
metavoice  |   "transformers_version": "4.37.2",
metavoice  |   "trim_right_ratio": 1.0,
metavoice  |   "upsampling_ratios": [
metavoice  |     8,
metavoice  |     5,
metavoice  |     4,
metavoice  |     2
metavoice  |   ],
metavoice  |   "use_causal_conv": true,
metavoice  |   "use_conv_shortcut": true
metavoice  | }
metavoice  |
metavoice  | downloading https://huggingface.co/facebook/encodec_24khz/resolve/main/model.safetensors to /.hf-cache/hub/tmpvc3lmcq6
model.safetensors: 100% 93.1M/93.1M [00:01<00:00, 91.1MB/s]
metavoice  | loading weights file model.safetensors from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/model.safetensors
metavoice  | /usr/local/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
metavoice  |   warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
metavoice  | All model checkpoint weights were used when initializing EncodecModel.
metavoice  |
metavoice  | All the weights of EncodecModel were initialized from the model checkpoint at facebook/encodec_24khz.
metavoice  | If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
metavoice  | downloading https://huggingface.co/facebook/multiband-diffusion/resolve/main/mbd_comp_8.pt to /.hf-cache/hub/tmpfwh3_tk2
mbd_comp_8.pt: 100% 4.58G/4.58G [00:44<00:00, 103MB/s]
metavoice  | number of parameters: 14.07M
metavoice  | loading configuration file config.json from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/config.json
metavoice  | Model config EncodecConfig {
metavoice  |   "_name_or_path": "ArthurZ/encodec_24khz",
metavoice  |   "architectures": [
metavoice  |     "EncodecModel"
metavoice  |   ],
metavoice  |   "audio_channels": 1,
metavoice  |   "chunk_length_s": null,
metavoice  |   "codebook_dim": 128,
metavoice  |   "codebook_size": 1024,
metavoice  |   "compress": 2,
metavoice  |   "dilation_growth_rate": 2,
metavoice  |   "hidden_size": 128,
metavoice  |   "kernel_size": 7,
metavoice  |   "last_kernel_size": 7,
metavoice  |   "model_type": "encodec",
metavoice  |   "norm_type": "weight_norm",
metavoice  |   "normalize": false,
metavoice  |   "num_filters": 32,
metavoice  |   "num_lstm_layers": 2,
metavoice  |   "num_residual_layers": 1,
metavoice  |   "overlap": null,
metavoice  |   "pad_mode": "reflect",
metavoice  |   "residual_kernel_size": 3,
metavoice  |   "sampling_rate": 24000,
metavoice  |   "target_bandwidths": [
metavoice  |     1.5,
metavoice  |     3.0,
metavoice  |     6.0,
metavoice  |     12.0,
metavoice  |     24.0
metavoice  |   ],
metavoice  |   "torch_dtype": "float32",
metavoice  |   "transformers_version": "4.37.2",
metavoice  |   "trim_right_ratio": 1.0,
metavoice  |   "upsampling_ratios": [
metavoice  |     8,
metavoice  |     5,
metavoice  |     4,
metavoice  |     2
metavoice  |   ],
metavoice  |   "use_causal_conv": true,
metavoice  |   "use_conv_shortcut": true
metavoice  | }
metavoice  |
metavoice  | loading weights file model.safetensors from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/model.safetensors
metavoice  | All model checkpoint weights were used when initializing EncodecModel.
metavoice  |
metavoice  | All the weights of EncodecModel were initialized from the model checkpoint at facebook/encodec_24khz.
metavoice  | If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
metavoice  | 2024-02-13 15:09:20 | INFO     | DF | Running on torch 2.2.0
metavoice  | 2024-02-13 15:09:20 | INFO     | DF | Running on host f8841432a821
metavoice  | loading weights file model.safetensors from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/model.safetensors
metavoice  | All model checkpoint weights were used when initializing EncodecModel.
metavoice  |
metavoice  | All the weights of EncodecModel were initialized from the model checkpoint at facebook/encodec_24khz.
metavoice  | If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
metavoice  | 2024-02-13 15:33:30 | INFO     | DF | Running on torch 2.2.0
metavoice  | 2024-02-13 15:33:30 | INFO     | DF | Running on host 13e976bff3d8
metavoice  | fatal: not a git repository (or any of the parent directories): .git
metavoice  | 2024-02-13 15:33:30 | INFO     | DF | Loading model settings of DeepFilterNet3
metavoice  | 2024-02-13 15:33:31 | INFO     | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3
metavoice  | 2024-02-13 15:33:31 | INFO     | DF | Initializing model `deepfilternet3`
metavoice  | 2024-02-13 15:33:31 | INFO     | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
metavoice  | 2024-02-13 15:33:31 | INFO     | DF | Running on device cpu
metavoice  | 2024-02-13 15:33:31 | INFO     | DF | Model loaded
metavoice  | INFO:     Started server process [1]
metavoice  | INFO:     Waiting for application startup.
metavoice  | INFO:     Application startup complete.
metavoice  | INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)

groovybits avatar Feb 13 '24 14:02 groovybits

Would love for you to make a PR once you have got this exploration over the line! Both containerisation and inference on Mac would be a very useful contribution for the community @groovybits

sidroopdaska avatar Feb 13 '24 17:02 sidroopdaska

Yes that sounds good, I have this branch that I am working off of https://github.com/metavoiceio/metavoice-src/compare/main...groovybits:metavoice-src:docker and will make a PR when it is ready.

(note I am on metal/mps and not completely sure if there are caveats ahead :) also doing this since in mac mps I can't build xformers / seems it says they will not work on mac currently, which is what I see.).

Update: Ok it almost works (I had to install curl, remove the Ava voice...) Yet it doesn't like the sample voice Bria it seems either... (I am checking if anything else seems wrong with setup? I used the default ffmpeg in the debian docker image, is that possibly an issue? I notice the Ava voice isn't in the demo on site so maybe the ones in the ui app.py are not correct?)

Server:

metavoice-server  | INFO:     Uvicorn running on http://0.0.0.0:58003 (Press CTRL+C to quit)
metavoice-server  | INFO:     127.0.0.1:41170 - "GET / HTTP/1.1" 404 Not Found
getting cached speaker ref files:   0% 0/1 [00:00<?, ?it/s]  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
metavoice-server  |                                  Dload  Upload   Total   Spent    Left  Speed
100 3999k  100 3999k    0     0  24.8M      0 --:--:-- --:--:-- --:--:-- 24.8M
metavoice-server  | [src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
getting cached speaker ref files: 100% 1/1 [00:02<00:00,  2.60s/it]
calculating speaker embeddings:   0% 0/1 [00:00<?, ?it/s][src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
metavoice-server exited with code 139

groovybits avatar Feb 13 '24 17:02 groovybits

please pull from main again, we have pushed a few changes since you last forked it seems

i'll test your branch in a few hours

sidroopdaska avatar Feb 13 '24 19:02 sidroopdaska

FYI: I can get it building with my Mac native now, when running I run into this issue. The docker seems to give the previous mentioned error so looks more docker specific on that problem. I suspect this may be an issue with it on MPS still?


If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training. 2024-02-13 14:09:01 | INFO | DF | Running on torch 2.1.0 2024-02-13 14:09:01 | INFO | DF | Running on host earth.local 2024-02-13 14:09:01 | INFO | DF | Git commit: eb7338abb, branch: stable 2024-02-13 14:09:01 | INFO | DF | Loading model settings of DeepFilterNet3 2024-02-13 14:09:01 | INFO | DF | Using DeepFilterNet3 model at /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3 2024-02-13 14:09:01 | INFO | DF | Initializing model deepfilternet3 2024-02-13 14:09:01 | INFO | DF | Found checkpoint /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120 2024-02-13 14:09:01 | INFO | DF | Running on device cpu 2024-02-13 14:09:01 | INFO | DF | Model loaded INFO: Started server process [73304] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:58003 (Press CTRL+C to quit) getting cached speaker ref files: 0%| | 0/1 [00:00<?, ?it/s][src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken. getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.16it/s] calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1116.99it/s] batch: 0%| | 0/1 [00:00<?, ?it/s] Error processing request {'text': 'Test 123', 'guidance': 3.0, 'top_p': 0.95, 'speaker_ref_path': 'https://cdn.themetavoice.xyz/speakers/bria.mp3'} Traceback (most recent call last): File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/serving.py", line 105, in text_to_speech wav_out_path = sample_utterance( ^^^^^^^^^^^^^^^^^ File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 546, in sample_utterance return _sample_utterance_batch( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 477, in _sample_utterance_batch b_tokens = first_stage_model( ^^^^^^^^^^^^^^^^^^ File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 356, in call return self.causal_sample( ^^^^^^^^^^^^^^^^^^^ File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 231, in causal_sample y = self.model.generate( ^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/model.py", line 369, in generate return self._causal_sample( ^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 392, in _causal_sample if guidance_scale[1] > 1: ~~~~~~~~~~~~~~^^^ TypeError: 'float' object is not subscriptable INFO: 192.168.50.55:51261 - "POST /tts HTTP/1.1" 500 Internal Server Error

Seems the guidance scale part is not as expected, I tried to

groovybits avatar Feb 13 '24 22:02 groovybits

Guidance scale should be (3.0, 1.0)... what is it currently set to?

vatsalaggarwal avatar Feb 13 '24 22:02 vatsalaggarwal

I am using fam/ui/app.py with the defaults it sets, it seems to be trying to use a decimal, yet I can't see how to fix that anywhere.

groovybits avatar Feb 13 '24 23:02 groovybits

The ui sends the "Speaker similarity" value as a single float instead of a tuple. I just changed line 66 in fam/ui/app.py to "guidance": [d_guidance, 1.0], which seems to have made it work for me. (Not sure if it makes sense like this though.)

djmaze avatar Feb 13 '24 23:02 djmaze

@pyetras

vatsalaggarwal avatar Feb 13 '24 23:02 vatsalaggarwal

On another note, choosing any preset voice other than Bria is not working because the other samples are shorter than 30s.

djmaze avatar Feb 13 '24 23:02 djmaze

On another note, choosing any preset voice other than Bria is not working because the other samples are shorter than 30s.

give me a few mins to fix this @djmaze

sidroopdaska avatar Feb 14 '24 00:02 sidroopdaska

Done, please try now

sidroopdaska avatar Feb 14 '24 00:02 sidroopdaska

I'm seeing this now after getting past the past issues, this is on mps m2 ultra, guessing it can't install the flash-attn module hence the part of it that would work isn't loaded...

2024-02-13 16:52:44 | INFO     | DF | Running on torch 2.1.0
2024-02-13 16:52:44 | INFO     | DF | Running on host earth.local
2024-02-13 16:52:44 | INFO     | DF | Git commit: eb7338abb, branch: stable
2024-02-13 16:52:44 | INFO     | DF | Loading model settings of DeepFilterNet3
2024-02-13 16:52:44 | INFO     | DF | Using DeepFilterNet3 model at /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3
2024-02-13 16:52:44 | INFO     | DF | Initializing model `deepfilternet3`
2024-02-13 16:52:44 | INFO     | DF | Found checkpoint /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2024-02-13 16:52:44 | INFO     | DF | Running on device cpu
2024-02-13 16:52:44 | INFO     | DF | Model loaded
INFO:     Started server process [77674]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:58003 (Press CTRL+C to quit)
getting cached speaker ref files:   0%|                                                                                                                                                     | 0/1 [00:00<?, ?it/s][src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.21it/s]
calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1099.14it/s]
batch:   0%|                                                                                                                                                                                | 0/1 [00:00<?, ?it/s][hack!!!!] Guidance is on, so we're doubling/tripling batch size!                                                                                                                        | 0/1728 [00:00<?, ?it/s]
tokens:   0%|                                                                                                                                                                            | 0/1728 [00:00<?, ?it/s]
batch:   0%|                                                                                                                                                                                | 0/1 [00:00<?, ?it/s]
Error processing request {'text': 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model by MetaVoice.', 'guidance': [3.0, 1.0], 'top_p': 0.95, 'speaker_ref_path': 'https://cdn.themetavoice.xyz/speakers/bria.mp3'}
Traceback (most recent call last):
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/serving.py", line 105, in text_to_speech
    wav_out_path = sample_utterance(
                   ^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 546, in sample_utterance
    return _sample_utterance_batch(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 477, in _sample_utterance_batch
    b_tokens = first_stage_model(
               ^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 356, in __call__
    return self.causal_sample(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 231, in causal_sample
    y = self.model.generate(
        ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/model.py", line 369, in generate
    return self._causal_sample(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 410, in _causal_sample
    batch_idx = self._sample_batch(
                ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 264, in _sample_batch
    idx_next = self._sample_next_token(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 85, in _sample_next_token
    list_logits, _ = self(
                     ^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/model.py", line 282, in forward
    x = block(x)
        ^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/combined.py", line 50, in forward
    x = x + self.attn(self.ln_1(x))
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/attn.py", line 291, in forward
    y = self._fd_attention(c_x)
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/attn.py", line 191, in _fd_attention
    y = flash_attn_with_kvcache(
        ^^^^^^^^^^^^^^^^^^^^^^^
NameError: name 'flash_attn_with_kvcache' is not defined
INFO:     127.0.0.1:57688 - "POST /tts HTTP/1.1" 500 Internal Server Error

groovybits avatar Feb 14 '24 00:02 groovybits

Thanks, I can test again later today. Concerning flash_attn, yeah it seems to me that it is currently required. So I added pip install packaging and pip install flash-attn to my dockerfile which made it work for me. (I should note that I am using CUDA with an RTX card.)

djmaze avatar Feb 14 '24 01:02 djmaze

Done, please try now

Thanks, it works now. (Wondering what looping the reference audio does to the quality though..)

djmaze avatar Feb 14 '24 09:02 djmaze

I also realized that there is already a PR open for docker support in #17. We should probably join forces over there?

djmaze avatar Feb 14 '24 09:02 djmaze

(Wondering what looping the reference audio does to the quality though..)

Haha, I did this as a stop-gap until we release a more extensive palette of preset voices in a couple of days. I haven't seen artefacts being introduced from looping these two voices. If you notice something peculiar, then please do share.

sidroopdaska avatar Feb 14 '24 11:02 sidroopdaska

I also realized that there is already a PR open for docker support in https://github.com/metavoiceio/metavoice-src/pull/17. We should probably join forces over there?

Yes, please!

sidroopdaska avatar Feb 14 '24 11:02 sidroopdaska

I also realized that there is already a PR open for docker support in #17. We should probably join forces over there?

Yes, please!

Definitely agree, this was more of an exercise in getting it working. I think I now am dealing with metal/mps issues beyond basic dockerization (and seems metal/mps can't pass the GPU through docker which may explain my 139 error in the docker). Outside of the docker on my metal/mps mac I run into the lack of flash_attn packages.

groovybits avatar Feb 14 '24 14:02 groovybits

I also realized that there is already a PR open for docker support in #17. We should probably join forces over there?

Yes, please!

Definitely agree, this was more of an exercise in getting it working. I think I now am dealing with metal/mps issues beyond basic dockerization (and seems metal/mps can't pass the GPU through docker which may explain my 139 error in the docker). Outside of the docker on my metal/mps mac I run into the lack of flash_attn packages.

the docker implementation i did should work fine with CUDA GPU-acceleration on any system...

l4b4r4b4b4 avatar Feb 15 '24 10:02 l4b4r4b4b4