Amphion icon indicating copy to clipboard operation
Amphion copied to clipboard

[Help]: Requesting some guidance / documentation on choosing appropriate parameters for mssbcqt

Open codename0og opened this issue 1 year ago • 5 comments

Hello! I'd be very glad if I could get some more information how to adapt mssbcqt discriminator for 48khz audio.

Lately I've been trying to improve the current architecture of RVC ( retrieval-based-voice-conversion ) by adopting ms-sb-cqt and ms-stft discriminators however from what I can see, it was tested on ( and supposedly the config is for ) 24khz audio. Essentially, I am interested in receiving some guidance on how to properly decide on params for cqt.:

        filters=32,
        max_filters=1024,
        filters_scale=1,
        dilations=[1, 2, 4],
        in_channels=1,
        out_channels=1,
        hop_lengths= [512, 256, 256],
        n_octaves=[9, 9, 9],
        bins_per_octaves=[24, 36, 48],  

For more details, this is the current config I use for training pretrained models for rvc:

   },
  "data": {
    "max_wav_value": 32768.0,
    "sampling_rate": 48000,
    "filter_length": 2048,
    "hop_length": 480,
    "win_length": 2048,
    "n_mel_channels": 128,
    "mel_fmin": 0.0,
    "mel_fmax": null
  },
  "model": {
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "kernel_size": 3,
    "p_dropout": 0,
    "resblock": "1",
    "resblock_kernel_sizes": [3,7,11],
    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
    "upsample_rates": [12,10,2,2],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [24,20,4,4],
    "use_spectral_norm": false,
    "gin_channels": 256,
    "spk_embed_dim": 109
  }
}

As an important note: I intend to pair mssbcqt / msstft combo along with the existing MultiPeriodDiscriminator used in RVC. Kindly thank you in advance!

codename0og avatar Nov 17 '24 01:11 codename0og

Bumping up.

codename0og avatar Nov 19 '24 16:11 codename0og

@VocodexElysium Hmm.. So then? Any kind of feedback I can count for? ( Pardon the @ but it's been 2 weeks )

codename0og avatar Dec 04 '24 17:12 codename0og

Bump. Still waiting patiently ~

codename0og avatar Jan 01 '25 20:01 codename0og

@codename0og Thanks for your patience! It seems the author @VocodexElysium hasn't responded yet, some suggestions from me are:

  1. https://github.com/NVIDIA/BigVGAN configurations since they have 44k config
  2. check the long paper https://arxiv.org/abs/2404.17161

thanks!

jiaqili3 avatar Jan 02 '25 08:01 jiaqili3

I would like to piggyback the question to ask about the upsampling in CQT discriminators. I couldn't find anything in the paper.

https://github.com/open-mmlab/Amphion/blob/f25ba323234031d6310bcbecbed57bb09fe5fb45/models/vocoders/gan/discriminator/mssbcqtd.py#L39-L41

and

https://github.com/open-mmlab/Amphion/blob/f25ba323234031d6310bcbecbed57bb09fe5fb45/models/vocoders/gan/discriminator/mssbcqtd.py#L110-L115

Why is this necessary, does anyone have insights into it?

EDIT: I have figured it out. Without resampling, CQT analysis parameters result in filterbanks with center frequencies higher than the Nyquist rate.

egaznep avatar Apr 01 '25 08:04 egaznep