[Help] [Metis] Voice Conversion Irreproducible
[Help] [Metis] Voice Conversion Irreproducible
Problem Overview
The example code in models/tts/metis/metis_infer_vc.py is incorrect and cannot be run as-is. Specifically:
- It loads
ft.jsonviaload_config, which is unrelated to voice conversion. - It attempts to load
metis_vc.safetensors, which does not exist in the HuggingFace repo. Only the following two files are available:-
metis_vc_lora_16.safetensors -
metis_vc_lora_16_adapter.safetensors
-
Steps Taken
-
Referred to the example usage for TTS here: https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis#2-example-usaage
-
Modified the code to a voice conversion (VC) version, as follows:
device = "cuda:0" metis_cfg = load_config("./models/tts/metis/config/vc.json") base_ckpt_dir = snapshot_download( "amphion/metis", repo_type="model", local_dir="./models/tts/metis/ckpt", allow_patterns=["metis_base/model.safetensors"], ) lora_ckpt_dir = snapshot_download( "amphion/metis", repo_type="model", local_dir="./models/tts/metis/ckpt", allow_patterns=["metis_vc/metis_vc_lora_16.safetensors"], ) adapter_ckpt_dir = snapshot_download( "amphion/metis", repo_type="model", local_dir="./models/tts/metis/ckpt", allow_patterns=["metis_vc/metis_vc_lora_16_adapter.safetensors"], ) base_ckpt_path = os.path.join(base_ckpt_dir, "metis_base/model.safetensors") lora_ckpt_path = os.path.join(lora_ckpt_dir, "metis_vc/metis_vc_lora_16.safetensors") adapter_ckpt_path = os.path.join(adapter_ckpt_dir, "metis_vc/metis_vc_lora_16_adapter.safetensors") metis = Metis( base_ckpt_path=base_ckpt_path, lora_ckpt_path=lora_ckpt_path, adapter_ckpt_path=adapter_ckpt_path, cfg=metis_cfg, device=device, model_type="vc", ) prompt_speech_path = "./models/tts/metis/wav/vc/prompt.wav" source_speech_path = "./models/tts/metis/wav/vc/source.wav" n_timesteps = 20 cfg = 1.0 gen_speech = metis( prompt_speech_path=prompt_speech_path, source_speech_path=source_speech_path, cfg=cfg, n_timesteps=n_timesteps, model_type="vc", ) sf.write("./models/tts/metis/wav/vc/gen.wav", gen_speech, 24000) -
Used the example WAV files in
models/tts/metis/wav/vc/.
Expected Outcome
Expected to generate intelligible and high-quality converted speech, similar to the samples on the demo page.
Actual Outcome
The generated audio is very low quality and does not contain any human voice — it's mostly noise. This makes the current VC pipeline irreproducible.
Environment Information
- Operating System: Ubuntu 20.04.5 LTS
- Python Version: 3.10.16
- Driver & CUDA Version: Driver 470.103.01 & CUDA 11.4
- Error Messages and Logs: No runtime errors, but the model output is unusable.
Additional Context
Please provide the correct inference code used to generate the demo samples at https://metis-demo.github.io/#metis-vc. It would be especially helpful if you could:
- Fix the example script at
metis_infer_vc.py - Clearly specify which checkpoint files are required
- Share the hyperparameters (
cfg,n_timesteps, etc.) and audio preprocessing steps used in your demos
Thanks for your work. I am more than excited to use Metis VC once this is resolved.
same
same
same
I was able to get voice conversion working, but only by using a legacy checkpoint that’s not included in the current main branch.
1. Download the legacy checkpoint
The file metis_vc.safetensors still exists in an older commit of the Hugging Face repo: 🔗 Commit 41499cbf5d91b06900fc774f673e70981f24b7ed
After downloading it, place it here:
models/tts/metis/ckpt/metis_vc/metis_vc.safetensors
2. Code changes
I skipped snapshot_download and pointed directly to the local checkpoint:
metis_cfg = load_config("./models/tts/metis/config/ft.json")
ckpt_dir = "models/tts/metis/ckpt"
ckpt_path = os.path.join(ckpt_dir, "metis_vc/metis_vc.safetensors")
metis = Metis(
ckpt_path=ckpt_path,
cfg=metis_cfg,
device=device,
model_type="vc",
)
prompt_speech_path = "./models/tts/metis/wav/vc/prompt.wav"
source_speech_path = "./models/tts/metis/wav/vc/source.wav"
n_timesteps = 20
cfg = 1.0
gen_speech = metis(
prompt_speech_path=prompt_speech_path,
source_speech_path=source_speech_path,
cfg=cfg,
n_timesteps=n_timesteps,
model_type="vc",
)
sf.write("./models/tts/metis/wav/vc/gen.wav", gen_speech, 24000)
3. Notes
This checkpoint does produce valid VC audio for me (instead of noise).
That said, it’s probably not the best or most up-to-date checkpoint — it’s from an older commit.
The current main branch only provides LoRA + adapter files, but I couldn’t reproduce VC successfully with those.