Amphion [Help] [Metis] Voice Conversion Irreproducible

[Help] [Metis] Voice Conversion Irreproducible

Problem Overview

The example code in models/tts/metis/metis_infer_vc.py is incorrect and cannot be run as-is. Specifically:

It loads ft.json via load_config, which is unrelated to voice conversion.
It attempts to load metis_vc.safetensors, which does not exist in the HuggingFace repo. Only the following two files are available:
- metis_vc_lora_16.safetensors
- metis_vc_lora_16_adapter.safetensors

Steps Taken

Referred to the example usage for TTS here: https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis#2-example-usaage

Modified the code to a voice conversion (VC) version, as follows:

device = "cuda:0"
metis_cfg = load_config("./models/tts/metis/config/vc.json")

base_ckpt_dir = snapshot_download(
    "amphion/metis",
    repo_type="model",
    local_dir="./models/tts/metis/ckpt",
    allow_patterns=["metis_base/model.safetensors"],
)
lora_ckpt_dir = snapshot_download(
    "amphion/metis",
    repo_type="model",
    local_dir="./models/tts/metis/ckpt",
    allow_patterns=["metis_vc/metis_vc_lora_16.safetensors"],
)
adapter_ckpt_dir = snapshot_download(
    "amphion/metis",
    repo_type="model",
    local_dir="./models/tts/metis/ckpt",
    allow_patterns=["metis_vc/metis_vc_lora_16_adapter.safetensors"],
)

base_ckpt_path = os.path.join(base_ckpt_dir, "metis_base/model.safetensors")
lora_ckpt_path = os.path.join(lora_ckpt_dir, "metis_vc/metis_vc_lora_16.safetensors")
adapter_ckpt_path = os.path.join(adapter_ckpt_dir, "metis_vc/metis_vc_lora_16_adapter.safetensors")

metis = Metis(
    base_ckpt_path=base_ckpt_path,
    lora_ckpt_path=lora_ckpt_path,
    adapter_ckpt_path=adapter_ckpt_path,
    cfg=metis_cfg,
    device=device,
    model_type="vc",
)

prompt_speech_path = "./models/tts/metis/wav/vc/prompt.wav"
source_speech_path = "./models/tts/metis/wav/vc/source.wav"

n_timesteps = 20
cfg = 1.0

gen_speech = metis(
    prompt_speech_path=prompt_speech_path,
    source_speech_path=source_speech_path,
    cfg=cfg,
    n_timesteps=n_timesteps,
    model_type="vc",
)

sf.write("./models/tts/metis/wav/vc/gen.wav", gen_speech, 24000)

Used the example WAV files in models/tts/metis/wav/vc/.

Expected Outcome

Expected to generate intelligible and high-quality converted speech, similar to the samples on the demo page.

Actual Outcome

The generated audio is very low quality and does not contain any human voice — it's mostly noise. This makes the current VC pipeline irreproducible.

Environment Information

Operating System: Ubuntu 20.04.5 LTS
Python Version: 3.10.16
Driver & CUDA Version: Driver 470.103.01 & CUDA 11.4
Error Messages and Logs: No runtime errors, but the model output is unusable.

Additional Context

Please provide the correct inference code used to generate the demo samples at https://metis-demo.github.io/#metis-vc. It would be especially helpful if you could:

Fix the example script at metis_infer_vc.py
Clearly specify which checkpoint files are required
Share the hyperparameters (cfg, n_timesteps, etc.) and audio preprocessing steps used in your demos

Thanks for your work. I am more than excited to use Metis VC once this is resolved.

May 09 '25 07:05 yileitu

same

May 31 '25 19:05 primepake

same

Jul 11 '25 14:07 VoicePrivacy

same

Jul 16 '25 19:07 anujsinha72094

I was able to get voice conversion working, but only by using a legacy checkpoint that’s not included in the current main branch.

1. Download the legacy checkpoint

The file metis_vc.safetensors still exists in an older commit of the Hugging Face repo: 🔗 Commit 41499cbf5d91b06900fc774f673e70981f24b7ed

After downloading it, place it here: models/tts/metis/ckpt/metis_vc/metis_vc.safetensors

2. Code changes

I skipped snapshot_download and pointed directly to the local checkpoint:

metis_cfg = load_config("./models/tts/metis/config/ft.json")

ckpt_dir = "models/tts/metis/ckpt"
ckpt_path = os.path.join(ckpt_dir, "metis_vc/metis_vc.safetensors")

metis = Metis(
    ckpt_path=ckpt_path,
    cfg=metis_cfg,
    device=device,
    model_type="vc",
)

prompt_speech_path = "./models/tts/metis/wav/vc/prompt.wav"
source_speech_path = "./models/tts/metis/wav/vc/source.wav"

n_timesteps = 20
cfg = 1.0

gen_speech = metis(
    prompt_speech_path=prompt_speech_path,
    source_speech_path=source_speech_path,
    cfg=cfg,
    n_timesteps=n_timesteps,
    model_type="vc",
)

sf.write("./models/tts/metis/wav/vc/gen.wav", gen_speech, 24000)

3. Notes

This checkpoint does produce valid VC audio for me (instead of noise).

That said, it’s probably not the best or most up-to-date checkpoint — it’s from an older commit.

The current main branch only provides LoRA + adapter files, but I couldn’t reproduce VC successfully with those.

Sep 12 '25 06:09 arishov1