I created a working Dockerfile to build it consistently (and on my system)...
Hi I created this Dockerfile just now to make building easier (and allow it on my system where Python seems unhappy, Mac)...
I am wondering about if this could be included and improved upon since I haven't gotten it running locally yet but at least got it built for me...
Branch with changes: https://github.com/metavoiceio/metavoice-src/compare/main...groovybits:metavoice-src:docker
Also I am wondering about which port it uses and how to use this? Or more information on how to use it locally like an http server if this is what does that?
Thanks! I am reading up / exploring as I can since this looks amazing :)
Updated: builds in docker compose now, not sure if "working" yet (need to look closer).
sure thing, will review it tomorrow morning
I think it's port 58003. Also, you should use pip install -e . for the app package to install correctly.
(Still, personally, I cannot currently make this work because of other errors.)
Hey @djmaze, can you share some details
- GPU being used
- Python version
- Error you are facing
I had to increase docker server memory ram allocated quite a bit, so now it loads (i was getting a 137 exit code until I fixed that by increasing ram for docker).
Here is the output now... (needed to add git to the installed programs then it works...)
Seems to overall work with the calls to it now to 58003 (I changed docker compose to map the 58003:58003 and also put server listening on 0.0.0.0 to allow access from the host system). Going to see how to use it now :) I am guessing it actually uses the GPU technically on MPS this way or is this needing something docker can't do for Metal GPU? I am still trying to understand that aspect of this...
Using Mac M2 Ultra with Python 3.11
(.venv) chris@earth metavoice-src % docker compose up --build
[+] Building 146.5s (12/12) FINISHED docker:desktop-linux
=> [metavoice-server internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 920B 0.0s
=> [metavoice-server internal] load metadata for docker.io/library/python:3.11-slim 0.4s
=> [metavoice-server internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [metavoice-server 1/7] FROM docker.io/library/python:3.11-slim@sha256:0b7568850e62c6405d15098486b8fc23d9d6a845b112b866aceaf65fbded2fc1 0.0s
=> [metavoice-server internal] load build context 0.8s
=> => transferring context: 3.26MB 0.8s
=> CACHED [metavoice-server 2/7] RUN apt-get update && apt-get install -y ffmpeg ninja-build g++ build-essential libomp-dev && rm -rf /var/ 0.0s
=> CACHED [metavoice-server 3/7] WORKDIR /app 0.0s
=> [metavoice-server 4/7] COPY . . 1.1s
=> [metavoice-server 5/7] RUN pip install --no-cache-dir "torch>=2.1.0" 9.5s
=> [metavoice-server 6/7] RUN MAX_JOBS=1 pip install --no-cache-dir -r requirements.txt 129.1s
=> [metavoice-server 7/7] RUN pip install -e . 3.6s
=> [metavoice-server] exporting to image 2.0s
=> => exporting layers 2.0s
=> => writing image sha256:f3f4f48ddd7728be8b77eb04b70a81c00cd452dd9f7dbff307fa2ea091e7bb51 0.0s
=> => naming to docker.io/library/metavoice-src-metavoice-server 0.0s
[+] Running 1/0
✔ Container metavoice Recreated 0.0s
Attaching to metavoice
metavoice | /usr/local/lib/python3.11/site-packages/df/io.py:9: UserWarning: `torchaudio.backend.common.AudioMetaData` has been moved to `torchaudio.AudioMetaData`. Please update the import path.
metavoice | from torchaudio.backend.common import AudioMetaData
metavoice | /app/fam/llm/layers/attn.py:14: UserWarning: flash_attn not installed, make sure to replace attention mechanism with torch_attn
metavoice | warnings.warn("flash_attn not installed, make sure to replace attention mechanism with torch_attn")
Fetching 6 files: 0% 0/6 [00:00<?, ?it/s]downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/second_stage.pt to /.hf-cache/hub/tmpe_usumgi
metavoice | downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/speaker_encoder.pt to /.hf-cache/hub/tmpnt4gjkty
metavoice | downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/.gitattributes to /.hf-cache/hub/tmpqt19yl5p
metavoice | downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/config.json to /.hf-cache/hub/tmputo3939d
metavoice | downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/README.md to /.hf-cache/hub/tmpq4thb9ry
.gitattributes: 100% 1.52k/1.52k [00:00<00:00, 15.5MB/s]
config.json: 100% 39.0/39.0 [00:00<00:00, 500kB/s]]
README.md: 100% 2.67k/2.67k [00:00<00:00, 26.2MB/s]
metavoice | 0% 0.00/2.67k [00:00<?, ?B/s] downloading https://huggingface.co/metavoiceio/metavoice-1B-v0.1/resolve/bf8f51bb9c3c508987b37f3197e85ea93f42475e/first_stage.pt to /.hf-cache/hub/tmpicnsur8a00, 54.5MB/s]
speaker_encoder.pt: 100% 17.1M/17.1M [00:00<00:00, 25.9MB/s]
second_stage.pt: 100% 57.9M/57.9M [00:00<00:00, 71.8MB/s]
first_stage.pt: 100% 4.97G/4.97G [00:47<00:00, 104MB/s]MB/s]
Fetching 6 files: 100% 6/6 [00:49<00:00, 8.17s/it]
metavoice | number of parameters: 1239.00M
metavoice | downloading https://huggingface.co/facebook/encodec_24khz/resolve/main/config.json to /.hf-cache/hub/tmp1itot20c
config.json: 100% 809/809 [00:00<00:00, 7.05MB/s]
metavoice | loading configuration file config.json from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/config.json
metavoice | Model config EncodecConfig {
metavoice | "_name_or_path": "ArthurZ/encodec_24khz",
metavoice | "architectures": [
metavoice | "EncodecModel"
metavoice | ],
metavoice | "audio_channels": 1,
metavoice | "chunk_length_s": null,
metavoice | "codebook_dim": 128,
metavoice | "codebook_size": 1024,
metavoice | "compress": 2,
metavoice | "dilation_growth_rate": 2,
metavoice | "hidden_size": 128,
metavoice | "kernel_size": 7,
metavoice | "last_kernel_size": 7,
metavoice | "model_type": "encodec",
metavoice | "norm_type": "weight_norm",
metavoice | "normalize": false,
metavoice | "num_filters": 32,
metavoice | "num_lstm_layers": 2,
metavoice | "num_residual_layers": 1,
metavoice | "overlap": null,
metavoice | "pad_mode": "reflect",
metavoice | "residual_kernel_size": 3,
metavoice | "sampling_rate": 24000,
metavoice | "target_bandwidths": [
metavoice | 1.5,
metavoice | 3.0,
metavoice | 6.0,
metavoice | 12.0,
metavoice | 24.0
metavoice | ],
metavoice | "torch_dtype": "float32",
metavoice | "transformers_version": "4.37.2",
metavoice | "trim_right_ratio": 1.0,
metavoice | "upsampling_ratios": [
metavoice | 8,
metavoice | 5,
metavoice | 4,
metavoice | 2
metavoice | ],
metavoice | "use_causal_conv": true,
metavoice | "use_conv_shortcut": true
metavoice | }
metavoice |
metavoice | downloading https://huggingface.co/facebook/encodec_24khz/resolve/main/model.safetensors to /.hf-cache/hub/tmpvc3lmcq6
model.safetensors: 100% 93.1M/93.1M [00:01<00:00, 91.1MB/s]
metavoice | loading weights file model.safetensors from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/model.safetensors
metavoice | /usr/local/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
metavoice | warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
metavoice | All model checkpoint weights were used when initializing EncodecModel.
metavoice |
metavoice | All the weights of EncodecModel were initialized from the model checkpoint at facebook/encodec_24khz.
metavoice | If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
metavoice | downloading https://huggingface.co/facebook/multiband-diffusion/resolve/main/mbd_comp_8.pt to /.hf-cache/hub/tmpfwh3_tk2
mbd_comp_8.pt: 100% 4.58G/4.58G [00:44<00:00, 103MB/s]
metavoice | number of parameters: 14.07M
metavoice | loading configuration file config.json from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/config.json
metavoice | Model config EncodecConfig {
metavoice | "_name_or_path": "ArthurZ/encodec_24khz",
metavoice | "architectures": [
metavoice | "EncodecModel"
metavoice | ],
metavoice | "audio_channels": 1,
metavoice | "chunk_length_s": null,
metavoice | "codebook_dim": 128,
metavoice | "codebook_size": 1024,
metavoice | "compress": 2,
metavoice | "dilation_growth_rate": 2,
metavoice | "hidden_size": 128,
metavoice | "kernel_size": 7,
metavoice | "last_kernel_size": 7,
metavoice | "model_type": "encodec",
metavoice | "norm_type": "weight_norm",
metavoice | "normalize": false,
metavoice | "num_filters": 32,
metavoice | "num_lstm_layers": 2,
metavoice | "num_residual_layers": 1,
metavoice | "overlap": null,
metavoice | "pad_mode": "reflect",
metavoice | "residual_kernel_size": 3,
metavoice | "sampling_rate": 24000,
metavoice | "target_bandwidths": [
metavoice | 1.5,
metavoice | 3.0,
metavoice | 6.0,
metavoice | 12.0,
metavoice | 24.0
metavoice | ],
metavoice | "torch_dtype": "float32",
metavoice | "transformers_version": "4.37.2",
metavoice | "trim_right_ratio": 1.0,
metavoice | "upsampling_ratios": [
metavoice | 8,
metavoice | 5,
metavoice | 4,
metavoice | 2
metavoice | ],
metavoice | "use_causal_conv": true,
metavoice | "use_conv_shortcut": true
metavoice | }
metavoice |
metavoice | loading weights file model.safetensors from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/model.safetensors
metavoice | All model checkpoint weights were used when initializing EncodecModel.
metavoice |
metavoice | All the weights of EncodecModel were initialized from the model checkpoint at facebook/encodec_24khz.
metavoice | If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
metavoice | 2024-02-13 15:09:20 | INFO | DF | Running on torch 2.2.0
metavoice | 2024-02-13 15:09:20 | INFO | DF | Running on host f8841432a821
metavoice | loading weights file model.safetensors from cache at /.hf-cache/hub/models--facebook--encodec_24khz/snapshots/c1dbe2ae3f1de713481a3b3e7c47f357092ee040/model.safetensors
metavoice | All model checkpoint weights were used when initializing EncodecModel.
metavoice |
metavoice | All the weights of EncodecModel were initialized from the model checkpoint at facebook/encodec_24khz.
metavoice | If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
metavoice | 2024-02-13 15:33:30 | INFO | DF | Running on torch 2.2.0
metavoice | 2024-02-13 15:33:30 | INFO | DF | Running on host 13e976bff3d8
metavoice | fatal: not a git repository (or any of the parent directories): .git
metavoice | 2024-02-13 15:33:30 | INFO | DF | Loading model settings of DeepFilterNet3
metavoice | 2024-02-13 15:33:31 | INFO | DF | Using DeepFilterNet3 model at /root/.cache/DeepFilterNet/DeepFilterNet3
metavoice | 2024-02-13 15:33:31 | INFO | DF | Initializing model `deepfilternet3`
metavoice | 2024-02-13 15:33:31 | INFO | DF | Found checkpoint /root/.cache/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
metavoice | 2024-02-13 15:33:31 | INFO | DF | Running on device cpu
metavoice | 2024-02-13 15:33:31 | INFO | DF | Model loaded
metavoice | INFO: Started server process [1]
metavoice | INFO: Waiting for application startup.
metavoice | INFO: Application startup complete.
metavoice | INFO: Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)
Would love for you to make a PR once you have got this exploration over the line! Both containerisation and inference on Mac would be a very useful contribution for the community @groovybits
Yes that sounds good, I have this branch that I am working off of https://github.com/metavoiceio/metavoice-src/compare/main...groovybits:metavoice-src:docker and will make a PR when it is ready.
(note I am on metal/mps and not completely sure if there are caveats ahead :) also doing this since in mac mps I can't build xformers / seems it says they will not work on mac currently, which is what I see.).
Update: Ok it almost works (I had to install curl, remove the Ava voice...) Yet it doesn't like the sample voice Bria it seems either... (I am checking if anything else seems wrong with setup? I used the default ffmpeg in the debian docker image, is that possibly an issue? I notice the Ava voice isn't in the demo on site so maybe the ones in the ui app.py are not correct?)
Server:
metavoice-server | INFO: Uvicorn running on http://0.0.0.0:58003 (Press CTRL+C to quit)
metavoice-server | INFO: 127.0.0.1:41170 - "GET / HTTP/1.1" 404 Not Found
getting cached speaker ref files: 0% 0/1 [00:00<?, ?it/s] % Total % Received % Xferd Average Speed Time Time Time Current
metavoice-server | Dload Upload Total Spent Left Speed
100 3999k 100 3999k 0 0 24.8M 0 --:--:-- --:--:-- --:--:-- 24.8M
metavoice-server | [src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
getting cached speaker ref files: 100% 1/1 [00:02<00:00, 2.60s/it]
calculating speaker embeddings: 0% 0/1 [00:00<?, ?it/s][src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
metavoice-server exited with code 139
please pull from main again, we have pushed a few changes since you last forked it seems
i'll test your branch in a few hours
FYI: I can get it building with my Mac native now, when running I run into this issue. The docker seems to give the previous mentioned error so looks more docker specific on that problem. I suspect this may be an issue with it on MPS still?
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncodecModel for predictions without further training.
2024-02-13 14:09:01 | INFO | DF | Running on torch 2.1.0
2024-02-13 14:09:01 | INFO | DF | Running on host earth.local
2024-02-13 14:09:01 | INFO | DF | Git commit: eb7338abb, branch: stable
2024-02-13 14:09:01 | INFO | DF | Loading model settings of DeepFilterNet3
2024-02-13 14:09:01 | INFO | DF | Using DeepFilterNet3 model at /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3
2024-02-13 14:09:01 | INFO | DF | Initializing model deepfilternet3
2024-02-13 14:09:01 | INFO | DF | Found checkpoint /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2024-02-13 14:09:01 | INFO | DF | Running on device cpu
2024-02-13 14:09:01 | INFO | DF | Model loaded
INFO: Started server process [73304]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:58003 (Press CTRL+C to quit)
getting cached speaker ref files: 0%| | 0/1 [00:00<?, ?it/s][src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.16it/s]
calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1116.99it/s]
batch: 0%| | 0/1 [00:00<?, ?it/s]
Error processing request {'text': 'Test 123', 'guidance': 3.0, 'top_p': 0.95, 'speaker_ref_path': 'https://cdn.themetavoice.xyz/speakers/bria.mp3'}
Traceback (most recent call last):
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/serving.py", line 105, in text_to_speech
wav_out_path = sample_utterance(
^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 546, in sample_utterance
return _sample_utterance_batch(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 477, in _sample_utterance_batch
b_tokens = first_stage_model(
^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 356, in call
return self.causal_sample(
^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 231, in causal_sample
y = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/model.py", line 369, in generate
return self._causal_sample(
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 392, in _causal_sample
if guidance_scale[1] > 1:
~~~~~~~~~~~~~~^^^
TypeError: 'float' object is not subscriptable
INFO: 192.168.50.55:51261 - "POST /tts HTTP/1.1" 500 Internal Server Error
Seems the guidance scale part is not as expected, I tried to
Guidance scale should be (3.0, 1.0)... what is it currently set to?
I am using fam/ui/app.py with the defaults it sets, it seems to be trying to use a decimal, yet I can't see how to fix that anywhere.
The ui sends the "Speaker similarity" value as a single float instead of a tuple. I just changed line 66 in fam/ui/app.py to "guidance": [d_guidance, 1.0], which seems to have made it work for me. (Not sure if it makes sense like this though.)
@pyetras
On another note, choosing any preset voice other than Bria is not working because the other samples are shorter than 30s.
On another note, choosing any preset voice other than Bria is not working because the other samples are shorter than 30s.
give me a few mins to fix this @djmaze
Done, please try now
I'm seeing this now after getting past the past issues, this is on mps m2 ultra, guessing it can't install the flash-attn module hence the part of it that would work isn't loaded...
2024-02-13 16:52:44 | INFO | DF | Running on torch 2.1.0
2024-02-13 16:52:44 | INFO | DF | Running on host earth.local
2024-02-13 16:52:44 | INFO | DF | Git commit: eb7338abb, branch: stable
2024-02-13 16:52:44 | INFO | DF | Loading model settings of DeepFilterNet3
2024-02-13 16:52:44 | INFO | DF | Using DeepFilterNet3 model at /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3
2024-02-13 16:52:44 | INFO | DF | Initializing model `deepfilternet3`
2024-02-13 16:52:44 | INFO | DF | Found checkpoint /Users/chris/Library/Caches/DeepFilterNet/DeepFilterNet3/checkpoints/model_120.ckpt.best with epoch 120
2024-02-13 16:52:44 | INFO | DF | Running on device cpu
2024-02-13 16:52:44 | INFO | DF | Model loaded
INFO: Started server process [77674]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:58003 (Press CTRL+C to quit)
getting cached speaker ref files: 0%| | 0/1 [00:00<?, ?it/s][src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 101 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.
getting cached speaker ref files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.21it/s]
calculating speaker embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1099.14it/s]
batch: 0%| | 0/1 [00:00<?, ?it/s][hack!!!!] Guidance is on, so we're doubling/tripling batch size! | 0/1728 [00:00<?, ?it/s]
tokens: 0%| | 0/1728 [00:00<?, ?it/s]
batch: 0%| | 0/1 [00:00<?, ?it/s]
Error processing request {'text': 'This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model by MetaVoice.', 'guidance': [3.0, 1.0], 'top_p': 0.95, 'speaker_ref_path': 'https://cdn.themetavoice.xyz/speakers/bria.mp3'}
Traceback (most recent call last):
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/serving.py", line 105, in text_to_speech
wav_out_path = sample_utterance(
^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 546, in sample_utterance
return _sample_utterance_batch(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 477, in _sample_utterance_batch
b_tokens = first_stage_model(
^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 356, in __call__
return self.causal_sample(
^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/sample.py", line 231, in causal_sample
y = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/model.py", line 369, in generate
return self._causal_sample(
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 410, in _causal_sample
batch_idx = self._sample_batch(
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 264, in _sample_batch
idx_next = self._sample_next_token(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/mixins/causal.py", line 85, in _sample_next_token
list_logits, _ = self(
^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/model.py", line 282, in forward
x = block(x)
^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/combined.py", line 50, in forward
x = x + self.attn(self.ln_1(x))
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/attn.py", line 291, in forward
y = self._fd_attention(c_x)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/chris/code/rsllm/metavoice-src-groovybits/fam/llm/layers/attn.py", line 191, in _fd_attention
y = flash_attn_with_kvcache(
^^^^^^^^^^^^^^^^^^^^^^^
NameError: name 'flash_attn_with_kvcache' is not defined
INFO: 127.0.0.1:57688 - "POST /tts HTTP/1.1" 500 Internal Server Error
Thanks, I can test again later today. Concerning flash_attn, yeah it seems to me that it is currently required. So I added pip install packaging and pip install flash-attn to my dockerfile which made it work for me. (I should note that I am using CUDA with an RTX card.)
Done, please try now
Thanks, it works now. (Wondering what looping the reference audio does to the quality though..)
I also realized that there is already a PR open for docker support in #17. We should probably join forces over there?
(Wondering what looping the reference audio does to the quality though..)
Haha, I did this as a stop-gap until we release a more extensive palette of preset voices in a couple of days. I haven't seen artefacts being introduced from looping these two voices. If you notice something peculiar, then please do share.
I also realized that there is already a PR open for docker support in https://github.com/metavoiceio/metavoice-src/pull/17. We should probably join forces over there?
Yes, please!
I also realized that there is already a PR open for docker support in #17. We should probably join forces over there?
Yes, please!
Definitely agree, this was more of an exercise in getting it working. I think I now am dealing with metal/mps issues beyond basic dockerization (and seems metal/mps can't pass the GPU through docker which may explain my 139 error in the docker). Outside of the docker on my metal/mps mac I run into the lack of flash_attn packages.
I also realized that there is already a PR open for docker support in #17. We should probably join forces over there?
Yes, please!
Definitely agree, this was more of an exercise in getting it working. I think I now am dealing with metal/mps issues beyond basic dockerization (and seems metal/mps can't pass the GPU through docker which may explain my 139 error in the docker). Outside of the docker on my metal/mps mac I run into the lack of flash_attn packages.
the docker implementation i did should work fine with CUDA GPU-acceleration on any system...