Amphion icon indicating copy to clipboard operation
Amphion copied to clipboard

[Help] [Metis] Voice Conversion Irreproducible

Open yileitu opened this issue 8 months ago • 4 comments

[Help] [Metis] Voice Conversion Irreproducible

Problem Overview

The example code in models/tts/metis/metis_infer_vc.py is incorrect and cannot be run as-is. Specifically:

  • It loads ft.json via load_config, which is unrelated to voice conversion.
  • It attempts to load metis_vc.safetensors, which does not exist in the HuggingFace repo. Only the following two files are available:
    • metis_vc_lora_16.safetensors
    • metis_vc_lora_16_adapter.safetensors

Steps Taken

  1. Referred to the example usage for TTS here: https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis#2-example-usaage

  2. Modified the code to a voice conversion (VC) version, as follows:

    device = "cuda:0"
    metis_cfg = load_config("./models/tts/metis/config/vc.json")
    
    base_ckpt_dir = snapshot_download(
        "amphion/metis",
        repo_type="model",
        local_dir="./models/tts/metis/ckpt",
        allow_patterns=["metis_base/model.safetensors"],
    )
    lora_ckpt_dir = snapshot_download(
        "amphion/metis",
        repo_type="model",
        local_dir="./models/tts/metis/ckpt",
        allow_patterns=["metis_vc/metis_vc_lora_16.safetensors"],
    )
    adapter_ckpt_dir = snapshot_download(
        "amphion/metis",
        repo_type="model",
        local_dir="./models/tts/metis/ckpt",
        allow_patterns=["metis_vc/metis_vc_lora_16_adapter.safetensors"],
    )
    
    base_ckpt_path = os.path.join(base_ckpt_dir, "metis_base/model.safetensors")
    lora_ckpt_path = os.path.join(lora_ckpt_dir, "metis_vc/metis_vc_lora_16.safetensors")
    adapter_ckpt_path = os.path.join(adapter_ckpt_dir, "metis_vc/metis_vc_lora_16_adapter.safetensors")
    
    metis = Metis(
        base_ckpt_path=base_ckpt_path,
        lora_ckpt_path=lora_ckpt_path,
        adapter_ckpt_path=adapter_ckpt_path,
        cfg=metis_cfg,
        device=device,
        model_type="vc",
    )
    
    prompt_speech_path = "./models/tts/metis/wav/vc/prompt.wav"
    source_speech_path = "./models/tts/metis/wav/vc/source.wav"
    
    n_timesteps = 20
    cfg = 1.0
    
    gen_speech = metis(
        prompt_speech_path=prompt_speech_path,
        source_speech_path=source_speech_path,
        cfg=cfg,
        n_timesteps=n_timesteps,
        model_type="vc",
    )
    
    sf.write("./models/tts/metis/wav/vc/gen.wav", gen_speech, 24000)
    
  3. Used the example WAV files in models/tts/metis/wav/vc/.

Expected Outcome

Expected to generate intelligible and high-quality converted speech, similar to the samples on the demo page.

Actual Outcome

The generated audio is very low quality and does not contain any human voice — it's mostly noise. This makes the current VC pipeline irreproducible.

Environment Information

  • Operating System: Ubuntu 20.04.5 LTS
  • Python Version: 3.10.16
  • Driver & CUDA Version: Driver 470.103.01 & CUDA 11.4
  • Error Messages and Logs: No runtime errors, but the model output is unusable.

Additional Context

Please provide the correct inference code used to generate the demo samples at https://metis-demo.github.io/#metis-vc. It would be especially helpful if you could:

  • Fix the example script at metis_infer_vc.py
  • Clearly specify which checkpoint files are required
  • Share the hyperparameters (cfg, n_timesteps, etc.) and audio preprocessing steps used in your demos

Thanks for your work. I am more than excited to use Metis VC once this is resolved.

yileitu avatar May 09 '25 07:05 yileitu

same

primepake avatar May 31 '25 19:05 primepake

same

VoicePrivacy avatar Jul 11 '25 14:07 VoicePrivacy

same

anujsinha72094 avatar Jul 16 '25 19:07 anujsinha72094

I was able to get voice conversion working, but only by using a legacy checkpoint that’s not included in the current main branch.

1. Download the legacy checkpoint

The file metis_vc.safetensors still exists in an older commit of the Hugging Face repo: 🔗 Commit 41499cbf5d91b06900fc774f673e70981f24b7ed

After downloading it, place it here: models/tts/metis/ckpt/metis_vc/metis_vc.safetensors

2. Code changes

I skipped snapshot_download and pointed directly to the local checkpoint:

metis_cfg = load_config("./models/tts/metis/config/ft.json")

ckpt_dir = "models/tts/metis/ckpt"
ckpt_path = os.path.join(ckpt_dir, "metis_vc/metis_vc.safetensors")

metis = Metis(
    ckpt_path=ckpt_path,
    cfg=metis_cfg,
    device=device,
    model_type="vc",
)

prompt_speech_path = "./models/tts/metis/wav/vc/prompt.wav"
source_speech_path = "./models/tts/metis/wav/vc/source.wav"

n_timesteps = 20
cfg = 1.0

gen_speech = metis(
    prompt_speech_path=prompt_speech_path,
    source_speech_path=source_speech_path,
    cfg=cfg,
    n_timesteps=n_timesteps,
    model_type="vc",
)

sf.write("./models/tts/metis/wav/vc/gen.wav", gen_speech, 24000)

3. Notes

This checkpoint does produce valid VC audio for me (instead of noise).

That said, it’s probably not the best or most up-to-date checkpoint — it’s from an older commit.

The current main branch only provides LoRA + adapter files, but I couldn’t reproduce VC successfully with those.

arishov1 avatar Sep 12 '25 06:09 arishov1